Small study effects in a meta-analysis of racial discrimination field experiments
Below is a discussion of small study effects in the data for the 2017 PNAS article, "Meta-analysis of field experiments shows no change in racial discrimination in hiring over time", by Lincoln Quillian, Devah Pager, Ole Hexel, and Arnfinn Midtbøen. The first part is the initial analysis that I sent to Dr. Quillian. The Quillian et al. team replied here, also available via this link a level up. I responded to this reply below my initial analysis and will notify Dr. Quillian of the reply. Please note that Quillian et al. 2017 mentions publication bias analyses on page 5 of its main text and in Section 5 of the supporting information appendix.
---
Initial analysis
Levels of discrimination against Black job applicants in the United States have not changed much or at all over the past 25 years is a conclusion of the Quillian et al. 2017 PNAS article, based on a meta-analysis that focuses on 1989-2015 field experiments assessing discrimination against Black or Hispanic job applicants relative to White applicants. The credibility of this conclusion depends at least on the meta-analysis including the population of relevant field experiments or a representative set of relevant field experiments. However, the graph below for the dataset set of Black/White discrimination field experiments is consistent with what would be expected if the meta-analysis did not have a complete set of studies.
The graphs plot a measure of the precision of each study against the corresponding effect size estimate, from the dmap_update_1024recoded_3.dta dataset available here. For a population of studies or for a representative set of studies, the pattern of points is expected to approximate a symmetric pyramid peaking at zero on the y-axis. The logic of this expectation is that, if there were a single true underlying effect, the size of that effect would be the estimated effect size from a perfectly-precise study, which would have a standard error of zero. The average effect size for less-than-perfectly-precise studies should also approximate the true effect size, but any given less-than-perfectly-precise study would not necessarily produce an estimate of the true effect size and would be expected to produce estimates that often fall to one side or the other side of the true effect size, with estimates from lower-precision studies falling further on average from the true effect size than estimates from higher-precision studies, thus creating the expected symmetric pyramid shape.
Egger's test assesses asymmetry in the shape of a pattern of points. The p-value of 0.003 for the Black/White set of studies indicates the presence of sufficient evidence to conclude with reasonable certainty that the pattern of points for the 1989-2015 set of Black/White discrimination field experiments is asymmetric. This particular pattern of asymmetry could have been caused by the higher-precision studies having tested for discrimination in situations with lower levels of anti-Black discrimination relative to situations for the lower-precision studies. But this pattern could also have been produced by suppression of low-precision studies that had null results or had results that indicated discrimination favoring Blacks relative to Whites.
Any inference from analyses of the set of 1989-2015 Black/White discrimination field experiments should thus consider the possibility that the set is incomplete and that any such incompleteness might bias inferences. For example, assessing patterns over time without any adjustment for possible missing studies requires an assumption that the inclusion of any missing studies would not alter the particular inference being made. That might be a reasonable assumption, but it should be identified as an assumption of any such inference.
The graphs below attempt to assess this assumption, by plotting estimates for the 10 earliest 1989-2015 Black/White field experiments and the 10 latest 1989-2015 Black/White field experiments, excluding the study that had no year indicated in the dataset for the year of the fieldwork. Both graphs are at least suggestive of the same type of small study effects.
Statistical methods have been developed to estimate the true effect size in meta-analyses after accounting for the possibility that the meta-analysis does not include the population of relevant studies or at least a representative set of relevant studies. For example, the top 10 percent by precision method, the trim-and-fill method with a linear estimator, and the PET-PEESE method cut the estimate of discrimination across the Black/White discrimination field experiments from 36 percent fewer callbacks or interviews to 25 percent, 21 percent, and 20 percent, respectively. These estimates, though, depend heavily on a lack of publication bias in highly-precise studies, which adds another assumption to these analyses and underscores the importance of preregistering studies.
Social science should inform public beliefs and public policy, but the ability of social scientists to not report data that have been collected and analyzed cannot help but undercut this important role for social science. Social scientists should consider preregistering their plans to conduct studies and their planned research designs for analyzing data, to restrict their ability to suppress undesired results and to thus add credibility to their research and to social science in general.
---
Reply from the Quillian et al.
---
My response to the Quillian et al. reply
[1] The second section heading in the Quillian et al. reply correctly states that "Tests based on funnel plot asymmetry often generate false positives as indicators of publication bias". The Quillian et al. reply reported the funnel plot to the left below and the Egger's test p-value of 0.647 for the set of 13 Black/White discrimination resume audit correspondence field experiments, which provide little-to-no evidence of small study effects or publication bias. However, the funnel plot of the residual set of 8 Black/White discrimination field experiments—of in-person-audits—has an asymmetric shape and a p=0.043 Egger's test indicative of small study effects.
The Quillian et al. reply indicated that "Using only resume audits to analyze change over time gives no trend (the linear slope is -.002, almost perfectly flat, shown in figure 3 in our original paper, and the weighted-average discrimination ratio is 1.32, only slightly below the ratio of all studies of 1.36)". For me at least, the lack of a temporal pattern in the resume audit (correspondence) field experiments is more convincing after seeing the funnel plot pattern than when not knowing the funnel plot pattern, although now the inference is limited to racial discrimination between 2001 and 2015 because there were no dataset correspondence field experiments conducted between 1989 and 2000. The top graph below illustrates this nearly-flat -0.002 slope for correspondence audit field experiments. Presuming no publication bias or presuming a constant effect of publication bias, it is reasonable to infer that there was no decrease in the level of White-over-Black favoring in correspondence audit field experiments between 2001 and 2015.
But presuming no publication bias or presuming a constant effect of publication bias, the slope for in-person audits in the bottom graph above indicates a potentially alarming increase in discrimination favoring Whites over Blacks, from the early 1990s to the post-2000 years, with slope of 0.03 and a corresponding p-value of p=0.08. But maybe there's a good reason to not include the three field experiments from 1990 and 1991 with a decade gap between the latest of these three field experiments and the set of post-2000 field experiments. If so, the slope of the line for Black/White discrimination correspondence studies and Black/White discrimination in-person audit studies pooled together from 2001 to 2015 is -0.02 with a p-value of p=0.059, and depicted below.
[2] I don't object to the use of the publication bias test reported on in Quillian et al. 2017. My main objections are to the non-reporting of a funnel plot and to basing the inference that "publication or write-up bias is unlikely to have produced inflated discrimination estimates" (p. 6 of the supporting information index) on a null result from a regression with 21 points and five independent variables. Trim-and-fill lowered the meta-analysis estimate from 0.274 to 0.263 for the 1989-2015 Black/White discrimination correspondence audits, but lowered the 1989-2015 Black/White discrimination in-person audit meta-analysis estimate from 0.421 to 0.158. The trim-and-fill decrease for the pooled set of 1989-2015 Black/White discrimination field experiments is from 0.307 to 0.192.
Funnel plots and corresponding tests of funnel plot asymmetry indicate at most the presence of small study effects, which could be caused by phenomena other than publication bias. The Quillian et al. reply notes that "we find evidence that the difference between in person versus resume audit may create false positives for this test" (p. 4). This information and the reprinted funnel plots below are useful because they suggest multiple reasons to not pool results from in-person audits and correspondence audits for Black/White discrimination, such as [i] the possibility of publication bias in the in-person audit set of studies or [ii] possible differences in mean effect sizes for in-person audits compared to correspondence audits.
Maybe the best way to report these results is a flat line for correspondence audits indicating no change between 2001 and 2015 (N=13) and a downward-sloping-but-not-statistically-significant line for in-person audits between 2001 and 2015 (N=5), with an upward-sloping-but-not-statistically-significant line for in-person audits between 1989 and 2015 (N=8).
[3] This section discusses the publication bias test used by Quillian et al. 2017. I'll use "available" to describe field experiments retrieved in the search for published and unpublished field experiments.
The Quillian et al. reply (pp. 1-2) describes the logic of the publication bias test that they used as:
If publication bias is a serious issue, then studies that focus on factors other than race/ethnic discrimination should show lower discrimination than studies focused primarily on race/ethnicity, because for the latter studies (but not the former) publication should be difficult for studies that do not find significant evidence of racial discrimination.
The expectation, as I understand it, is that discrimination field experiments with race as the primary focus will have a range of estimates, some of which are statically significant and some of which are not statically significant. If there is publication bias such that race-as-the-primary-focus field experiments that do not find discrimination against Blacks are less likely to be available than race-as-the-primary-focus field experiments that find discrimination against Blacks, then the estimate of discrimination against Blacks in the available race-as-the-primary-focus field experiments should be artificially inflated above the true value of racial discrimination. This publication bias test involves a comparison of this presumed inflated effect size to the effect size from field experiments in which race was not the primary focus, which presumably is closer to the true value of racial discrimination because non-availability in the non-race-as-the-primary-focus field experiments is not primarily due to the p-value and direction for racial discrimination but is instead or primarily due to the p-value and direction for the other type of discrimination. The publication bias test is whether the effect size for the available non-race-focused discrimination field experiments is smaller than effect size for the available race-focused discrimination field experiments.
The effect size for racial discrimination from field experiments in which race was not the primary focus might still be inflated in the presence of publication bias because [non-race-as-the-primary-focus field experiments that don't find discrimination in the primary focus but do find discrimination in the race manipulation] are plausibly more likely to be available than [non-race-as-the-primary-focus field experiments that don't find discrimination in the primary focus or in the race manipulation].
But let's stipulate that the racial discrimination effect size from non-race-as-the-primary-focus field experiments should be smaller than the racial discrimination effect size from race-as-the-primary-focus field experiments. If so, how large must this expected difference be such that the observed null result (0.051 coefficient, 0.112 standard error) in the N=21 five-independent-variable regression in Table S7 of Quillian et al. 2017 should be interpreted as evidence of the absence of nontrivial levels of publication bias?
For what it's worth, the publication bias test in the regression below reflects the test used in Quillian et al. 2017, but with a different model and with removal of the three field experiments from 1990 and 1991, such that the sample is the set of Black/White discrimination field experiments from 2001 to 2015. The control for the study method indicates that in-person audits have an estimated 0.40 larger effect size than correspondence audits. The 95 percent confidence interval for the race_not_focus predictor ranges from -0.21 to 0.18. Is that range inconsistent with the expected value based on this test if there were nontrivial amounts of publication bias?
Data available at the webpage for Quillian et al. 2017 [here]
My R code [here]
My Stata code [here]