Jeremy Freese recently linked to a Jason Mitchell essay that discussed perceived problems with replications. Mitchell discussed many facets of replication, but I will restrict this post to Mitchell's claim that "[r]ecent hand-wringing over failed replications in social psychology is largely pointless, because unsuccessful experiments have no meaningful scientific value."

Mitchell's claim appears to be based on a perceived asymmetry between positive and negative findings: "When an experiment succeeds, we can celebrate that the phenomenon survived these all-too-frequent shortcomings. But when an experiment fails, we can only wallow in uncertainty about whether a phenomenon simply does not exist or, rather, whether we were just a bit too human that time around."

Mitchell is correct that a null finding can be caused by experimental error, but Mitchell appears to overlook the fact that positive findings can also be caused by experimental error.

---

Mitchell also appears to confront only the possible "ex post" value of replications, but there is a possible "ex ante" value to replications.

Ward Farnsworth discussed ex post and ex ante thinking using the example of a person who accidentally builds a house that extends onto a neighbor's property: ex post thinking concerns how to best resolve the situation at hand, but ex ante thinking concerns how to make this problem less likely to occur in the future; tearing down the house is a wasteful decision through the perspective of ex post thinking, but it is a good decision from the ex ante perspective because it incentivizes more careful construction in the future.

In a similar way, the threat of replication incentivizes more careful social science. Rational replicators should gravitate toward research for which the evidence appears to be relatively fragile: all else equal, the value of a replication is higher for replicating a study based on 83 undergraduates at one particular college than for replicating a study based on a nationally-representative sample of 1,000 persons; all else equal, a replicator should pass on replicating a stereotype threat study in which the dependent variable is percent correct in favor of replicating a study in which the stereotype effect was detected only using the more unusual measure of percent accuracy, measured as the percent correct of the problems that the respondent attempted.

Mitchell is correct that there is a real possibility that a researcher's positive finding will not be replicated because of error on the part of the replicator, but, as a silver lining, this negative possibility incentivizes researchers concerned about failed replications to produce higher-quality research that reduces the chance that a replicator targets their research in the first place.

Comments to this scatterplot post contained a discussion about when one-tailed statistical significance tests are appropriate. I'd say that one-tailed tests are appropriate only for a certain type of applied research. Let me explain...

Statistical significance tests attempt to assess the probability that we mistake noise for signal. The conventional 0.05 level of statistical significance in social science represents a willingness to mistake noise for signal 5% of the time.

Two-tailed tests presume that these errors can occur because we mistake noise for signal in the positive direction or because we mistake noise for signal in the negative direction: therefore, for two-tailed tests we typically allocate half of the acceptable error to the left tail and half of the acceptable error to the right tail.

One-tailed tests presume either that: (1) we will never mistake noise for signal in one of the directions because it is impossible to have a signal in that direction, so that permits us to place all of the acceptable error in the other direction's tail; or (2) we are interested only in whether there is an effect in a particular direction, so that permits us to place all of the acceptable error in that direction's tail.

Notice that it is easier to mistake noise for signal in a one-tailed test than in a two-tailed test because one-tailed tests have more acceptable error in the tail that we are interested in.

So let's say that we want to test the hypothesis that X has a particular directional effect on Y. Use of a one-tailed test would mean either that: (1) it is impossible that the true direction is the opposite of the direction predicted by the hypothesis or (2) we don't care whether the true direction is the opposite of the direction predicted by the hypothesis.

I'm not sure that we can ever declare things impossible in social science research, so (1) is not justified. The problem with (2) is that -- for social science conducted to understand the world -- we should always want to differentiate between "no evidence of an effect at a statistically significant level" and "evidence of an effect at a statistically significant level, but in the direction opposite to what we expected."

To illustrate a problem with (2), let's say that we commit before the study to a one-tailed test for whether X has a positive effect on Y, but the results of the study indicate that the effect of X on Y is negative at a statistically significant level, at least if we had used a two-tailed test. Now we are in a bind: if we report only that there is no evidence that X has a positive effect on Y at a statistically significant level, then we have omitted important information about the results; but if we report that the effect of X on Y is negative at a statistically significant level with a two-tailed test, then we have abandoned our original commitment to a one-tailed test in the hypothesized direction.

---

Now, when is a one-tailed test justified? The best justification that I have encountered for a one-tailed test is the scenario in which the same decision will be made if X has no effect on Y and if X has a particular directional effect on Y, such as "we will switch to a new program if the new program is equal to or better than our current program"; but that's for applied science, and not for social science conducted to understand the world: social scientists interested in understanding the world should care whether the new program is equal to or better than the current program.

---

In cases of strong theory or a clear prediction from the literature supporting a directional hypothesis, it might be acceptable -- before the study -- to allocate 1% of the acceptable error to the opposite direction and 4% of the acceptable error to the predicted direction, or some other unequal allocation of acceptable error. That unequal allocation of acceptable error would provide a degree of protection against unexpected effects that is lacking in a one-tailed test.

Ahlquist, Mayer, and Jackman (2013, p. 3) wrote:

List experiments are a commonly used social scientific tool for measuring the prevalence of illegal or undesirable attributes in a population. In the context of electoral fraud, list experiments have been successfully used in locations as diverse as Lebanon, Russia and Nicaragua. They present our best tool for detecting fraudulent voting in the United States.*

I'm not sure that list experiments are the best tool for detecting fraudulent voting in the United States. But, first, let's introduce the list experiment.

The list experiment goes back at least to Judith Droitcour Miller's 1984 dissertation, but she called the procedure the item count method (see page 188 of this 1991 book). Ahlquist, Mayer, and Jackman (2013) reported results from list experiments that split a sample into two groups: members of the first group received a list of 4 items and were instructed to indicate how many of the 4 items applied to themselves; members of the second group received a list of 5 items -- the same 4 items that the first group received, plus an additional item -- and were instructed to indicate how many of the 5 items applied to themselves. The difference in the mean number of items selected by the groups was then used to estimate the percent of the sample and -- for weighted data -- the percent of the population to which the fifth item applied.

Ahlquist, Mayer, and Jackman (2013) reported four list experiments from September 2013, with these statements as the fifth item:

  • "I cast a ballot under a name that was not my own."
  • "Political candidates or activists offered you money or a gift for your vote."
  • "I read or wrote a text (SMS) message while driving."
  • "I was abducted by extraterrestrials (aliens from another planet)."

Figure 4 of Ahlquist, Mayer, and Jackman (2013) displayed results from three of these list experiments:

amj2013f4

My presumption is that vote buying and voter impersonation are low frequency events in the United States: I'd probably guess somewhere between 0 and 1 percent, and closer to 0 percent than to 1 percent. If that's the case, then a list experiment with 3,000 respondents is not going to detect such low frequency events. 95 percent confidence intervals for weighted estimates in Figure 4 appear to span 20 percentage points or more: the weighted 95 percent confidence interval for vote buying appears to range from -7 percent to 17 percent. Moreover, notice how much estimates varied between the December 2012 and September 2013 waves of the list experiment: the point estimate for voter impersonation in December 2012 was 0 percent, and the point estimate for voter impersonation in September 2013 was -10 percent, a ten-point swing in point estimates.

So, back to the original point, list experiments are not the best tool for detecting vote fraud in the United States because vote fraud in the United States is a low frequency event that list experiments cannot detect without an improbably large sample size: the article indicates that at least 260,000 observations would be necessary to detect a 1% difference.

If that's the case, then what's the purpose of a list experiment to detect vote fraud with only 3,000 observations? Ahlquist, Mayer, and Jackman (2013, p. 31) wrote that:

From a policy perspective, our findings are broadly consistent with the claims made by opponents of stricter voter ID laws: voter impersonation was not a serious problem in the 2012 election.

The implication appears to be that vote fraud is a serious problem only if the fraud is common. But there's a lot of problems that are serious without being common.

So, if list experiments are not the best tool for detecting vote fraud in the United States, then what is a better way? I think that -- if the goal is detecting the presence of vote fraud and not estimating its prevalence -- then this is one of those instances in which journalism is better than social science.

---

* This post was based on the October 30, 2013, version of the Ahlquist, Mayer, and Jackman manuscript, which was located here. A more recent version is located here and has replaced the "best tool" claim about list experiments:

List experiments are a commonly used social scientific tool for measuring the prevalence of illegal or undesirable attributes in a population. In the context of electoral fraud, list experiments have been successfully used in locations as diverse as Lebanon, Russia, and Nicaragua. They present a powerful but unused tool for detecting fraudulent voting in the United States.

It seems that "unused" is applicable, but I'm not sure that a "powerful" tool for detecting vote fraud in the United States would produce 95 percent confidence intervals that span 20 percentage points.

P.S. The figure posted above has also been modified in the revised manuscript. I have a pdf of the October 30, 2013, version, in case you are interested in verifying the quotes and figure.

I came across an interesting site, Dynamic Ecology, and saw a post on self-archiving of journal articles.The post mentioned SHERPA/RoMEO, which lists archiving policies for many journals. The only journal covered by SHERPA/RoMEO that I have published in that permits self-archiving is PS: Political Science & Politics, so I am linking below to pdfs of PS articles that I have published.

---

This first article attempts to help graduate students who need seminar paper ideas. The article grew out of a graduate seminar in US voting behavior with David C. Barker. I noticed that several articles on the seminar reading list placed in top-tier journals but made an incremental theoretical contribution and used publicly-available data, which was something that I as a graduate student felt that I could realistically aspire to.

For instance, John R. Petrocik in 1996 provided evidence that candidates and parties "owned" certain issues, such as Democrats owning care for the poor and Republicans owning national defense. Danny Hayes extended that idea by using publicly-available ANES data to provide evidence that candidates and parties owned certain traits, such as Democrats being more compassionate and Republicans being more moral.

The original manuscript identified the Hayes article as a travel-type article in which the traveling is done by analogy. The final version of the manuscript lost the Hayes citation but had 19 other ideas for seminar papers. Ideas on the cutting room floor included replication and picking a fight with another researcher.

Of Publishable Quality: Ideas for Political Science Seminar Papers. 2011. PS: Political Science & Politics 44(3): 629-633.

  1. pdf version, copyright held by American Political Science Association

---

This next article grew out of reviews that I conducted for friends, colleagues, and journals. I noticed that I kept making the same or similar comments, so I produced a central repository for generalized forms of these comments in the hope that -- for example -- I do not review any more manuscripts that formally list hypotheses about the control variables.

Rookie Mistakes: Preemptive Comments on Graduate Student Empirical Research Manuscripts. 2013. PS: Political Science & Politics 46(1): 142-146.

  1. pdf version, copyright held by American Political Science Association

---

The next article grew out of friend and colleague Jonathan Reilly's dissertation. Jonathan noticed that studies of support for democracy had treated don't know responses as if the respondents had never been asked the question. So even though 73 percent of respondents in China expressed support for democracy, that figure was reported as 96 percent because don't know responses were removed from the analysis.

The manuscript initially did not include imputation of preferences for non-substantive responders, but a referee encouraged us to estimate missing preferences. My prior was that multiple imputation was "making stuff up," but research into missing data methods taught me that the alternative -- deletion of cases -- assumed that cases were missing at random, which did not appear to be true in our study: the percent of missing cases in a country correlated at -0.30 and -0.43 with the country's Polity IV democratic rating, which meant that respondents were more likely to issue a non-substantive response in countries where political and social liberties are more restricted.

Don’t Know Much about Democracy: Reporting Survey Data with Non-Substantive Responses. 2012. PS: Political Science & Politics 45(3): 462-467. Second author, with Jonathan Reilly.

  1. pdf version, copyright held by American Political Science Association

For those of you coming from the Monkey Cage: welcome!

This is a blog on my research and other topics of interest. I'm in the middle of a series on incorrect survey weighting, which is part of a larger series on reproduction in social science. I'm a proponent of research transparency, such as preregistration of experimental studies to reduce researcher degrees of freedom, third-party data collection to reduce fraud, and public online archiving of data and code to increase the likelihood that error is discovered.

My main research areas right now are race, law, and their intersection. I plan to blog on those and other topics: I am expecting to post on list experiments, abortion attitudes, the file drawer problem, Supreme Court nominations, and curiosities in the archives at the Time-Sharing Experiments for the Social Sciences. I hope that you find something of interest.

---

UPDATE (May 21, 2014)

Links to the Monkey Cage post have been made at SCOTUSBlog, Jonathan Bernstein, and the American Constitution Society.

---

UPDATE (May 21, 2014)

Jonathan Bernstein commented on my Monkey Cage guest post, expressing skepticism about a real distinction between delayed and hastened retirements. The first part of my response was as follows:

Hi Jonathan,

Let me expand on the distinction between delayed and hastened retirements.

Imagine that Clarence Thomas reveals that he wants to retire this summer, but conservatives pressure him to delay his retirement until a Republican is elected president. Compare that to liberals pressuring Ruth Bader Ginsburg to retire before the 2016 election.

Note the distinctions: liberals are trying to change Ginsburg's mind about *whether* to retire, and conservatives are trying to change Thomas's mind about *when* to retire; moreover, conservatives are asking Thomas to sacrifice *extra* *personal* time that he would have had in retirement, and liberals are asking Ginsburg to sacrifice *all* the rest of her years as *one of the most powerful persons in the United States.*

Orin Kerr of the Volokh Conspiracy also commented on the post, at the Monkey Cage itself, asking why a model is necessary when the sample of justices is small enough to ask justices or use past interviews. My response:

Hi Orin,

Artemus Ward has a valuable book, Deciding to Leave, that offers more richness than statistical models offer for investigating the often idiosyncratic reasons for Supreme Court retirements. But for addressing whether justices retire strategically and, if so, when and under what conditions -- or for making quantitative predictions about whether a particular justice might retire at a given time -- there is complementary value in a statistical model.

1. For one thing, there is sometimes reason to be skeptical of the reasons that political actors provide for their behavior: there is a line of research suggesting that personal policy preferences inform Supreme Court justice voting on cases, though many justices might not admit this in direct questioning. Regarding retirements, many justices have been forthcoming about their strategic retirement planning, but some justices have downplayed or denied strategic planning: for example, Ward described press skepticism of Potter Stewart's assertion that he did not strategically delay retirement while Jimmy Carter was president (p. 194).

Statistical models permit us to test theories based on what Stewart and other justices *did* instead of what Stewart and other justices *said*, similar to the way that prosecutors might develop a theory of the crime based on forensic evidence instead of suspect statements.

2. But even if the justices were always honest and public about their reasons for retiring or not retiring, it is still necessary to apply some sort of statistical analysis to address our questions. By my count, from 1962 to 2010, 5 justices retired consistent with a delay strategy and 8 justices retired when the political environment was unfavorable. Observers using simple statistical tools might consider this evidence that justices are more likely to retire unstrategically than to delay retirement, but this overlooks the fact that justices have more opportunities to retire unstrategically than to delay retirement.

For example, assuming that no conservative retires during President Obama's eight years in office, the five conservative justices as a group will each have had eight years to retire unstrategically, for a total of 40 opportunities; but liberal justices have had fewer opportunities to delay retirement: Breyer, Ginsburg, Souter, and Stevens each had one opportunity to retire consistent with a delay strategy in 2009, and -- presuming that justices stay on another year to avoid a double summer vacancy -- Breyer, Ginsburg, Sotomayor, and Stevens each had one opportunity to retire consistent with a delay strategy in 2010, for a total of 8 opportunities.

In this particular period, the proper comparison is not 2 delayed retirements to 0 unstrategic retirements, but instead is 2 delayed retirements out of 8 opportunities (25%) to 0 unstrategic retirements out of 40 opportunities (0%).

3. Sotomayor's addition in the 2010 data highlights another value of statistical models: they permit us to control for other retirement pressures. Statistical models can help account -- in a way that qualitative studies or direct questioning cannot -- for the fact that the 2010 observation of Sotomayor is not equivalent to the 2010 observation of Ginsburg because these justices have different characteristics on other key variables, such as age. From 1962 to 2010, justices retired 14 percent of the time during delayed retirement opportunities, but retired only 4 percent of the time during unfavorable political environments. But these percentages should not be directly compared because there might be spurious correlations that have inflated or deflated the percentages: for example, perhaps older and infirm justices were more likely to experience a delayed opportunity and *that* is why the delayed percentage is relatively higher than the unstrategic percentage. Statistical models let us adjust summary statistics to address such spurious correlations.

---

Bill James is said to have said something to the effect that bad statistics are the alternative to good statistics. Relying only on justice statements instead of good statistics can introduce inferential error about justice retirement strategies in the aggregate in several ways: (1) justices might misrepresent their motives for retiring or not retiring; (2) we might not properly account for the fact that justices face more unstrategic opportunities than delayed opportunities or hasten opportunities; and (3) we might not properly account for variables such as age and illness that also influence decisions to retire.

John Sides at the Monkey Cage discusses an article on public broadcasting and political knowledge. The cross-sectional survey data analyzed in the article cannot resolve the question of causal direction, as Sides notes:

Obviously, there are challenges of sorting out correlation and causation here. Do people who consume public broadcasting become more knowledgeable? Or are knowledgeable people just more likely to consume public broadcasting? Via statistical modeling, Soroka and colleagues go some distance in isolating the possible effects of public broadcasting—though they are clear that their modeling is no panacea.
Nevertheless, the results are interesting. In most countries, people who consume more public broadcasting know more about current events than people who consume less of it. But these same differences emerge to a lesser extent among those who consume more or less commercial broadcasting. This suggests that public broadcasting helps citizens learn. Here's a graph:
soroka

But the article should not be interpreted as providing evidence that "public broadcasting helps citizens learn."

Cross-sectional survey data cannot resolve the question of causal direction, but theory can: if we observe a correlation between, say, race and voting for a particular political party, we can rule out the possibility that voting for a particular political party is causing race.

Notice that in the United Kingdom, consumption of commercial broadcasting news correlates with a substantial decrease in political knowledge: therefore, if the figure is interpreted as evidence that broadcasting causes knowledge, then it is necessary to interpret the UK results as commercial broadcasting news in the UK causing people to have less political knowledge. I think that we can safely rule out that possibility.

The results presented in the figure are more likely to reflect self-selection: persons in the UK with more political knowledge choose to watch public broadcasting news, and persons in the UK with less political knowledge choose to watch commercial broadcasting news; that doesn't mean that public broadcasting has zero effect on political knowledge, but it does mean that the evidence presented in the figure does not provide enough information to assess the magnitude of the effect.

From a New York Times article by Harvey Araton:

On a scale of 1 to 10, Andy Pettitte’s level of certitude seemed to be a 5. Halfway convinced he couldn’t grind out another year with the Yankees in New York, he opted for an unforced retirement in Houston to watch his children play sports and begin to figure out what to do with the rest of his life.

Perhaps the use of 1-to-10 scales should be retired, as well, because of the common misconception that 5 is halfway between 1 and 10. If you don't believe me, take a look:

This misconception is not restricted to sportswriters, as I reported in this article describing a review of thousands of interviews that the World Values Survey conducted around the world.

Among the data reported, respondents were asked whether they think that divorce can never be justified (1), always be justifiable (10), or something in between. Seventeen percent of the 61,070 respondents for which a response was available selected 5 on the scale, but only eight percent selected 6 on the scale. The figure below shows that 5 was more popular than 6 even in countries whose populations leaned toward the 10 end of the scale.

It seems, then, that 5 serves as the ‘‘psychological mid-point’’ (see Rose, Munro, & Mishler 2004) of the 1-to-10 scale, which means that some respondents signal their neutrality by selecting a value closer to left end of the scale. This is not good.

Source: Harvey Araton. 2011. Saying It's Time, but Sounding Less Certain. NY Times.