Below is a discussion of small study effects in the data for the 2017 PNAS article, "Meta-analysis of field experiments shows no change in racial discrimination in hiring over time", by Lincoln Quillian, Devah Pager, Ole Hexel, and Arnfinn Midtbøen. The first part is the initial analysis that I sent to Dr. Quillian. The Quillian et al. team replied here, also available via this link a level up. I responded to this reply below my initial analysis and will notify Dr. Quillian of the reply. Please note that Quillian et al. 2017 mentions publication bias analyses on page 5 of its main text and in Section 5 of the supporting information appendix.

---

Initial analysis

Levels of discrimination against Black job applicants in the United States have not changed much or at all over the past 25 years is a conclusion of the Quillian et al. 2017 PNAS article, based on a meta-analysis that focuses on 1989-2015 field experiments assessing discrimination against Black or Hispanic job applicants relative to White applicants. The credibility of this conclusion depends at least on the meta-analysis including the population of relevant field experiments or a representative set of relevant field experiments. However, the graph below for the dataset set of Black/White discrimination field experiments is consistent with what would be expected if the meta-analysis did not have a complete set of studies.

Comment Q2017 Figure 1

The graphs plot a measure of the precision of each study against the corresponding effect size estimate, from the dmap_update_1024recoded_3.dta dataset available here. For a population of studies or for a representative set of studies, the pattern of points is expected to approximate a symmetric pyramid peaking at zero on the y-axis. The logic of this expectation is that, if there were a single true underlying effect, the size of that effect would be the estimated effect size from a perfectly-precise study, which would have a standard error of zero. The average effect size for less-than-perfectly-precise studies should also approximate the true effect size, but any given less-than-perfectly-precise study would not necessarily produce an estimate of the true effect size and would be expected to produce estimates that often fall to one side or the other side of the true effect size, with estimates from lower-precision studies falling further on average from the true effect size than estimates from higher-precision studies, thus creating the expected symmetric pyramid shape.

Egger's test assesses asymmetry in the shape of a pattern of points. The p-value of 0.003 for the Black/White set of studies indicates the presence of sufficient evidence to conclude with reasonable certainty that the pattern of points for the 1989-2015 set of Black/White discrimination field experiments is asymmetric. This particular pattern of asymmetry could have been caused by the higher-precision studies having tested for discrimination in situations with lower levels of anti-Black discrimination relative to situations for the lower-precision studies. But this pattern could also have been produced by suppression of low-precision studies that had null results or had results that indicated discrimination favoring Blacks relative to Whites.

Any inference from analyses of the set of 1989-2015 Black/White discrimination field experiments should thus consider the possibility that the set is incomplete and that any such incompleteness might bias inferences. For example, assessing patterns over time without any adjustment for possible missing studies requires an assumption that the inclusion of any missing studies would not alter the particular inference being made. That might be a reasonable assumption, but it should be identified as an assumption of any such inference.

The graphs below attempt to assess this assumption, by plotting estimates for the 10 earliest 1989-2015 Black/White field experiments and the 10 latest 1989-2015 Black/White field experiments, excluding the study that had no year indicated in the dataset for the year of the fieldwork. Both graphs are at least suggestive of the same type of small study effects.

Comment Q2017 Figure 2

Statistical methods have been developed to estimate the true effect size in meta-analyses after accounting for the possibility that the meta-analysis does not include the population of relevant studies or at least a representative set of relevant studies. For example, the top 10 percent by precision method, the trim-and-fill method with a linear estimator, and the PET-PEESE method cut the estimate of discrimination across the Black/White discrimination field experiments from 36 percent fewer callbacks or interviews to 25 percent, 21 percent, and 20 percent, respectively. These estimates, though, depend heavily on a lack of publication bias in highly-precise studies, which adds another assumption to these analyses and underscores the importance of preregistering studies.

Social science should inform public beliefs and public policy, but the ability of social scientists to not report data that have been collected and analyzed cannot help but undercut this important role for social science. Social scientists should consider preregistering their plans to conduct studies and their planned research designs for analyzing data, to restrict their ability to suppress undesired results and to thus add credibility to their research and to social science in general.

---

Reply from the Quillian et al.

Here

---

My response to the Quillian et al. reply

[1] The second section heading in the Quillian et al. reply correctly states that "Tests based on funnel plot asymmetry often generate false positives as indicators of publication bias". The Quillian et al. reply reported the funnel plot to the left below and the Egger's test p-value of 0.647 for the set of 13 Black/White discrimination resume audit correspondence field experiments, which provide little-to-no evidence of small study effects or publication bias. However, the funnel plot of the residual set of 8 Black/White discrimination field experiments—of in-person-audits—has an asymmetric shape and a p=0.043 Egger's test indicative of small study effects.

Comment Q2017 Figure 3The Quillian et al. reply indicated that "Using only resume audits to analyze change over time gives no trend (the linear slope is -.002, almost perfectly flat, shown in figure 3 in our original paper, and the weighted-average discrimination ratio is 1.32, only slightly below the ratio of all studies of 1.36)". For me at least, the lack of a temporal pattern in the resume audit (correspondence) field experiments is more convincing after seeing the funnel plot pattern than when not knowing the funnel plot pattern, although now the inference is limited to racial discrimination between 2001 and 2015 because there were no dataset correspondence field experiments conducted between 1989 and 2000. The top graph below illustrates this nearly-flat -0.002 slope for correspondence audit field experiments. Presuming no publication bias or presuming a constant effect of publication bias, it is reasonable to infer that there was no decrease in the level of White-over-Black favoring in correspondence audit field experiments between 2001 and 2015.

Comment Q2017 Figure 4But presuming no publication bias or presuming a constant effect of publication bias, the slope for in-person audits in the bottom graph above indicates a potentially alarming increase in discrimination favoring Whites over Blacks, from the early 1990s to the post-2000 years, with slope of 0.03 and a corresponding p-value of p=0.08. But maybe there's a good reason to not include the three field experiments from 1990 and 1991 with a decade gap between the latest of these three field experiments and the set of post-2000 field experiments. If so, the slope of the line for Black/White discrimination correspondence studies and Black/White discrimination in-person audit studies pooled together from 2001 to 2015 is -0.02 with a p-value of p=0.059, and depicted below.

[2] I don't object to the use of the publication bias test reported on in Quillian et al. 2017. My main objections are to the non-reporting of a funnel plot and to basing the inference that "publication or write-up bias is unlikely to have produced inflated discrimination estimates" (p. 6 of the supporting information index) on a null result from a regression with 21 points and five independent variables. Trim-and-fill lowered the meta-analysis estimate from 0.274 to 0.263 for the 1989-2015 Black/White discrimination correspondence audits, but lowered the 1989-2015 Black/White discrimination in-person audit meta-analysis estimate from 0.421 to 0.158. The trim-and-fill decrease for the pooled set of 1989-2015 Black/White discrimination field experiments is from 0.307 to 0.192.

Funnel plots and corresponding tests of funnel plot asymmetry indicate at most the presence of small study effects, which could be caused by phenomena other than publication bias. The Quillian et al. reply notes that "we find evidence that the difference between in person versus resume audit may create false positives for this test" (p. 4). This information and the reprinted funnel plots below are useful because they suggest multiple reasons to not pool results from in-person audits and correspondence audits for Black/White discrimination, such as [i] the possibility of publication bias in the in-person audit set of studies or [ii] possible differences in mean effect sizes for in-person audits compared to correspondence audits.

Comment Q2017 Figure 3Maybe the best way to report these results is a flat line for correspondence audits indicating no change between 2001 and 2015 (N=13) and a downward-sloping-but-not-statistically-significant line for in-person audits between 2001 and 2015 (N=5), with an upward-sloping-but-not-statistically-significant line for in-person audits between 1989 and 2015 (N=8).

[3] This section discusses the publication bias test used by Quillian et al. 2017. I'll use "available" to describe field experiments retrieved in the search for published and unpublished field experiments.

The Quillian et al. reply (pp. 1-2) describes the logic of the publication bias test that they used as:

If publication bias is a serious issue, then studies that focus on factors other than race/ethnic discrimination should show lower discrimination than studies focused primarily on race/ethnicity, because for the latter studies (but not the former) publication should be difficult for studies that do not find significant evidence of racial discrimination.

The expectation, as I understand it, is that discrimination field experiments with race as the primary focus will have a range of estimates, some of which are statically significant and some of which are not statically significant. If there is publication bias such that race-as-the-primary-focus field experiments that do not find discrimination against Blacks are less likely to be available than race-as-the-primary-focus field experiments that find discrimination against Blacks, then the estimate of discrimination against Blacks in the available race-as-the-primary-focus field experiments should be artificially inflated above the true value of racial discrimination. This publication bias test involves a comparison of this presumed inflated effect size to the effect size from field experiments in which race was not the primary focus, which presumably is closer to the true value of racial discrimination because non-availability in the non-race-as-the-primary-focus field experiments is not primarily due to the p-value and direction for racial discrimination but is instead or primarily due to the p-value and direction for the other type of discrimination. The publication bias test is whether the effect size for the available non-race-focused discrimination field experiments is smaller than effect size for the available race-focused discrimination field experiments.

The effect size for racial discrimination from field experiments in which race was not the primary focus might still be inflated in the presence of publication bias because [non-race-as-the-primary-focus field experiments that don't find discrimination in the primary focus but do find discrimination in the race manipulation] are plausibly more likely to be available than [non-race-as-the-primary-focus field experiments that don't find discrimination in the primary focus or in the race manipulation].

But let's stipulate that the racial discrimination effect size from non-race-as-the-primary-focus field experiments should be smaller than the racial discrimination effect size from race-as-the-primary-focus field experiments. If so, how large must this expected difference be such that the observed null result (0.051 coefficient, 0.112 standard error) in the N=21 five-independent-variable regression in Table S7 of Quillian et al. 2017 should be interpreted as evidence of the absence of nontrivial levels of publication bias?

For what it's worth, the publication bias test in the regression below reflects the test used in Quillian et al. 2017, but with a different model and with removal of the three field experiments from 1990 and 1991, such that the sample is the set of Black/White discrimination field experiments from 2001 to 2015. The control for the study method indicates that in-person audits have an estimated 0.40 larger effect size than correspondence audits. The 95 percent confidence interval for the race_not_focus predictor ranges from -0.21 to 0.18. Is that range inconsistent with the expected value based on this test if there were nontrivial amounts of publication bias?

Comment Q2017 Figure 6---

Data available at the webpage for Quillian et al. 2017 [here]

My R code [here]

My Stata code [here]

Tagged with: , ,

One notable finding in the racial discrimination literature is the boomerang/backlash effect reported in Peffley and Hurwitz 2007:

"...whereas 36% of whites strongly favor the death penalty in the baseline condition, 52% strongly favor it when presented with the argument that the policy is racially unfair" (p. 1001).

The racially-unfair argument shown to participants was: "[Some people say/FBI statistics show] that the death penalty is unfair because most of the people who are executed are African Americans" (p. 1002). Statistics reported in Peffley and Hurwitz 2007 Table 1 indicate that responses differed at p<=0.05 for Whites in the baseline no-argument condition compared to Whites in the argument condition.

However, the boomerang/backlash effect did not appear at p<=0.05 in large-N MTurk direct and conceptual replication attempts reported on in Butler et al. 2017 or in my analysis of a nearly-direct replication attempt using a large-N sample of non-Hispanic Whites in a TESS study by Spencer Piston and Ashley Jardina with data collection by GfK, with a similar null result for a similar racial-bias-argument experiment regarding three strikes laws.

For the weighted TESS data, on a scale from 0 for strongly oppose to 1 for strongly favor, support for the death penalty for persons convicted of murder was 0.015 units lower (p=0.313, n=2018) in the condition in which participants were told "Some people say that the death penalty is unfair because most of the people who are executed are black", compared to the condition in which participants did not receive that statement, with controls for the main experimental conditions for the TESS study, which appeared earlier in the survey. This lack of statistical significance remained when the weighted sample was limited to liberals and extreme liberals; slight liberals, liberals, and extreme liberals; conservatives and extreme conservatives; and slight conservatives, conservatives, and extreme conservatives. There was also no statistically-significant difference between conditions in my analysis of the unweighted data. Regarding missing data, 7 of 1,034 participants in the control condition and 9 of 1,000 participants in the experimental condition did not provide a response.

Moreover, in the prior item on the survey, on a 0-to-1 scale, responses were 0.013 units higher (p=0.403, n=2025) for favoring three strikes laws in the condition in which participants were told that "...critics argue that these laws are unfair because they are especially likely to affect black people", compared to the compared to the condition in which participants did not receive that statement, with controls for the main experimental conditions for the TESS study, which appeared earlier in the survey. This lack of statistical significance remained when the weighted sample was limited to liberals and extreme liberals; slight liberals, liberals, and extreme liberals; conservatives and extreme conservatives; and slight conservatives, conservatives, and extreme conservatives. There was also no statistically-significant difference between conditions in my analysis of the unweighted data. Regarding missing data, 6 of 986 participants in the control condition and 3 of 1,048 participants in the experimental condition did not provide a response.

Null results might be attributable to participants not paying attention, so it is worth noting that the main treatment in the TESS experiment was that participants in one of the three conditions were given a passage to read entitled "Genes May Cause Racial Difference in Heart Disease" and participants in another of the three conditions were given a passage to read entitled "Social Conditions May Cause Racial Difference in Heart Disease". There was a statically-significant difference between these conditions in responses to an item about whether there are biological differences between blacks and whites (p=0.008, n=2,006), with responses in the Genes condition indicating greater estimates of biological differences between blacks and whites.

---

NOTE:

Data for the TESS study are available here. My Stata code is available here.

Tagged with: , , ,

Continuing from a Twitter thread that currently ended here...

Hi Jenn,

I don't think that it's disingenuous to compare two passages that assess discrimination in decision-making based on models of decision-making that lack measures of relevant non-discriminatory factors that could influence decisions. At that level of abstraction, the two passages are directly comparable.

My perception is that:

The evidence of discrimination against Asian Americans in the cited study about college admissions is stronger than the evidence of discrimination against Asian Americans in the cited study about earnings; therefore, not accepting the evidence of discrimination in the college admissions study as evidence of true discrimination suggests that the evidence of discrimination in the earnings study should also not be accepted as evidence of true discrimination.

I perceive the evidence of discrimination in the college admissions study to be stronger because [1] net of included controls, the college admissions gap appears to be larger than the earnings gap, [2] the college admissions study appears to have fewer and fewer important inferential issues involving samples and included controls [*], and [3] compared to decision-making about which applicants are admitted to a college, decision-making about how much a worker should be paid presumably involves more important information about relevant non-discriminatory factors that have not been included in the statistical control of the studies.

Moreover, including evidence from outside these studies, legal cases involving racial discrimination in college admissions have often involved decision-making that explicitly includes race as a factor. My presumption is that a larger percentage of recent college admissions decisions have been made in which race is an explicit factor in admissions compared to the percentage of recent earnings decisions that have been made in which race is an explicit factor in worker remuneration.

For what it's worth, I think that a residual net racial discrimination is likely across a large number of important decisions made in the absence of perfect information, such as decisions involving college admissions and earnings, and I think that it is reasonable to accept evidence of discrimination against Asian Americans based on the studies cited in both passages.

---

[*] Support for [2] above:

[2a] The study that reported an 8% earnings gap was limited to data for men age 25 to 64 with a college degree who were participating in the labor market. Estimates for comparing earnings of White men to earnings of Asian men should be expected to be skewed to the extent that White men and Asian men with the same earnings potential have a different probability of being a college graduate or have a different probability of being in the labor market.

[2b] I don't think that naively controlling for cost of living is correct because higher costs of living partly reflect job perks that should not be completely controlled for. If, after adjusting for cost of living, a person who works in San Francisco has the same equivalent earnings as a person who works in an uncomfortably-humid rural lower-cost-of-living area with few amenities, the person who works in San Francisco is nonetheless better off in terms of climate and access to amenities.

---

I'm not sure that selectivity in immigration is relevant. The earnings models control for factors such as highest degree, field of study for the highest degree, and Carnegie classification of the school for the highest degree. It's possible that, net of these controls, Asian American men workers have higher earnings potential than White American men workers, but I'm not aware of evidence for this.

Tagged with: , ,

Continuing from this Twitter thread...

Hi Logan,

1. I do not dispute the claimed correlation between White Southerners' racial attitudes and support for the Confederate battle flag, but the Wright and Esses 2017 analysis suggests an important causal claim that a meaningfully-large percentage of White Southerners support use of the Confederate battle flag for reasons unrelated to racial animus. Is there evidence that that causal claim is not correct?

2. I think that a "'heritage' doesn't tell us much" claim should be based on the performance of measures of pride in Southern heritage. Civil War knowledge and linked fate with Southerners are not measures of pride, so these measures cannot support a claim about the low explanatory power of pride.

3. Could you articulate what is inadequate about the Wright and Esses racial attitude measures? Given the results that Carney and Enos reported here, racial resentment does not appear to be an adequate measure of racial attitudes.

The Monkey Cage published a post by Dawn Langan Teele and Kathleen Thelen: "Some of the top political science journals are biased against women. Here's the evidence." The evidence presented for the claim of bias appears to be that women represent a larger percentage of the political science discipline than of authors in top political science journals. But that doesn't mean that the journals are biased against women, and the available data that I am aware of also doesn't indicate that the journals are biased against women:

1. Discussing data from World Politics (1999-2004), International Organization (2002), and Comparative Political Studies and International Studies Quarterly (three undisclosed years), Breuning and Sanders 2007 reported that "women fare comparatively well and appear in each journal at somewhat higher rates than their proportion among submitting authors" (p. 350).

2. Data for the American Journal of Political Science reported by Rick Wilson here indicated that 32% of submissions from 2010 to 2013 had at least one female author and 35% of accepted articles had at least one female author.

3. Based on data from 1983 to 2008 in the Journal of Peace Research, Østby et al. 2013 reported that: "If anything, female authors are more likely to be selected for publication [in JPR]".

4. Data below from Ishiyama 2017 for the American Political Science Review from 2012 to 2016 indicate that women served as first author for 27% of submitted manuscripts and 25% of accepted manuscripts.

APSR Data---

The data across the four points above do not indicate that these journals or corresponding peer reviewers are biased against women in this naive analysis. Of course, causal identification of bias would require a more representative sample beyond the largely volunteered data above and would require, for claims of bias among peer reviewers, statistical control for the quality of submissions and, for claims of bias at the editor level, statistical control for peer reviewer recommendations; analyses would get even more complicated accounting for the possibility that editor bias can influence peer reviewers selection, which can make the process easier or more difficult than would occur with unbiased assignment to peer reviewers.

Please let me know if you are aware of any other relevant data for political science journals.

---

NOTE

1 The authors of the Monkey Cage post have an article that cites Breuning and Sanders 2007 and Østby et al. 2013, but these data were not mentioned in the Monkey Cage post.

Tagged with: , ,

Based on a sample of undergraduate students at a university in Texas, Anderson et al. 2009 reported (p. 216) that:

Contrary to popular beliefs, feminists reported lower levels of hostility toward men than did nonfeminists.

But this stereotype-inconsistent pattern was based a coding of "feminist" that reflected whether a participant had defined "feminist" "in a way consistent with our operational definition of feminism" (p. 220) and not based on whether the participant self-identified as a feminist, a self-identification for which the researchers had data.

---

I assessed claims about self-identified feminists' views of men using data from the ANES 2016 Time Series Survey national sample. My first predictor was a dichotomous measure of sex, coded 1 for female and 0 for male. My second predictor was self-identified feminist, coded as 1 for a participant who identified as a feminist or strong feminist in variable V161345.

The best available dataset measures to construct a measure of negative attitudes toward men were measures of perceived levels of discrimination against men and women in the United States (V162363 and V162362, respectively). I coded participants as 1 in a dichotomous variable if the participant indicated "none at all" for the amount of discrimination against men in the United States but indicated a nonzero level of discrimination against women in the United States. Denial of discrimination is a plausible measure of negative attitudes toward a group that faces discrimination, and there is statistical evidence that men in the United States face discrimination in areas such as criminal sentencing (e.g., Doerner 2012 and Starr 2015); moreover, men are formally excluded from certain opportunities, such as opportunities at the NSF-funded Visions in Methodology conference.

---

In weighted regressions, 37% of nonfeminist women reported no discrimination against men and a nonzero level of discrimination against women, compared to 46% of feminist women, with a p-value of p=0.002 for the 9 percentage-point difference. However, the gap between feminist men and nonfeminist men was 20 percentage points, with 28% of nonfeminist men reporting no discrimination against men and a nonzero level of discrimination against women, compared to 48% of feminist men, with a p-value less than 0.001 for the difference. Feminist identification was thus associated with an 11 percentage-point larger difference in anti-male attitudes for men than for women, with a p-value for the difference of p=0.012.

Output for the interaction model is below:

denialDM

---

NOTES

1. My Stata code is here. ANES 2016 Time Series Study data is available here.

2. The denialDM output variable is dichotomous, but estimates and inferences do not change if logit is used instead of linear regression.

3. The dataset has another question (V161346) that asked participants how well "feminist" described them, on a 5-point scale (extremely well, very well, somewhat well, not very well, and not at all); inferences are the same using that measure. Inferences are also the same using V161345 to make a 3-part feminist measure coded from non-feminist to strong feminist. See the Stata code.

4. Hat tip to Nathaniel Bechhofer, who retweeted this tweet, which led to this post.

Tagged with:

I had a recent Twitter exchange about a Monkey Cage post:

Below, I use statistical power calculations to explain why the Ahlquist et al. paper, or at least the list experiment analysis cited in the Monkey Cage post, is not compelling.

---

Discussing the paper (published version here), Henry Farrell wrote:

So in short, this research provides exactly as much evidence supporting the claim that millions of people are being kidnapped by space aliens to conduct personally invasive experiments on, as it does to support Trump's claim that millions of people are engaging in voter fraud.

However, a survey with a sample size of three would also not be able to differentiate the percentage of U.S. residents who commit vote fraud from the percentage of U.S. residents abducted by aliens. For studies that produce a null result, it is necessary to assess the ability of the study to detect an effect of a particular size, to get a sense of how informative that null result is.

The Ahlquist et al. paper has a footnote [31] that can be used to estimate the statistical power for their list experiments: more than 260,000 total participants would be needed for a list experiment to have 80% power to detect a 1 percentage point difference between treatment and control groups, using an alpha of 0.05. The power calculator here indicates that the corresponding estimated standard deviation is at least 0.91 [see note 1 below].

So let's assume that list experiment participants are truthful and that we combine the 1,000 participants from the first Ahlquist et al. list experiment with the 3,000 participants from the second Ahlquist et al. list experiment, so that we'd have 2,000 participants in the control sample and 2,000 participants in the treatment sample. Statistical power calculations using an alpha of 0.05 and a standard deviation of 0.91 indicate that there is:

  • a 5 percent chance of detecting a 1% rate of vote fraud.
  • an 18 percent chance of detecting a 3% rate of vote fraud.
  • a 41 percent chance of detecting a 5% rate of vote fraud.
  • a 79 percent chance of detecting an 8% rate of vote fraud.
  • a 94 percent chance of detecting a 10% rate of vote fraud.

---

Let's return to the claim that millions of U.S. residents committed vote fraud and use 5 million for the number of adult U.S. residents who committed vote fraud in the 2016 election, eliding the difference between illegal votes and illegal voters. There are roughly 234 million adult U.S. residents (reference), so 5 million vote fraudsters would be 2.1% of the adult population, and a 4,000-participant list experiment would have about an 11 percent chance of detecting that 2.1% rate of vote fraud.

Therefore, if 5 million adult U.S. residents really did commit vote fraud, a list experiment with the sample size of the pooled Ahlquist et al. 2014 list experiments would produce a statistically-significant detection of vote fraud about 1 of every 9 times the list experiment was conducted. The fact that Ahlquist et al. 2014 didn't detect voter impersonation at a statistically-significant level doesn't appear to compel any particular belief about whether the rate of voter impersonation in the United States is large enough to influence the outcome of presidential elections.

---

NOTES

1. Enter 0.00 for mu1, 0.01 for mu2, 0.91 for sigma, 0.05 for alpha, and a 130,000 sample size for each sample; then hit Calculate. The power will be 0.80.

2. I previously discussed the Ahlquist et al. list experiments here and here. The second link indicates that an Ahlquist et al. 2014 list experiment did detect evidence of attempted vote buying.

Tagged with: , , ,