Christopher D. DeSante published an article in the American Journal of Political Science titled, "Working Twice as Hard to Get Half as Far: Race, Work Ethic, and America’s Deserving Poor" (57: 342-356, April 2013). The title refers to survey evidence that DeSante reported indicating that, compared to hypothetical white applicants for state assistance, hypothetical black applicants for state assistance received less reward for hard work and more punishment for laziness.

The study had a clever research design: respondents were shown two applications for state assistance, and each applicant was said to need $900, but there was variation in the names of the applicants (Emily, Laurie, Keisha, Latoya, or no name provided) and in the Worker Quality Assessment of the applicant (poor, excellent, or no assessment section provided); respondents were then asked to divide $1500 between the applicants or to use some or all of the $1500 to offset the state budget deficit.

Table 1 below indicates the characteristics of the conditions and the mean allocations made to each alternative. In condition 5, for example, 64 respondents were asked to divide $1500 between hardworking Laurie, lazy Emily, and offsetting the state budget deficit: hardworking Laurie received a mean allocation of $682, lazy Emily received a mean allocation of $566, and the mean allocation to offset the state budget deficit was $250.

DeSanteReproductionTable1blog

---

I'm going to quote DeSante (2013: 343) and intersperse comments about the claims. For the purpose of this analysis, let's presume that respondents interpreted Emily and Laurie as white applicants and Keisha and Latoya as black applicants. Reported p-values for my analysis below are two-tailed p-values. Here's the first part of our DeSante (2013: 343) quote.

Through a nationally representative survey experiment in which respondents were asked to make recommendations regarding who should receive government assistance, I find that American “principles” of individualism, hard work, and equal treatment serve to uniquely benefit whites in two distinct ways. First, the results show that compared to African Americans, whites are not automatically perceived as more deserving of government assistance.

Condition 7 paired Laurie with Keisha, neither of whom had a Worker Quality Assessment. Laurie received a mean allocation of $556, and Keisha received a mean allocation of $600. Keisha received $44 more than Laurie, a $44 difference that is statistically significant at p<0.01. So DeSante is technically correct that "whites are not automatically perceived as more deserving of government assistance," but this claim overlooks evidence from condition 7 that a white applicant was given LESS government assistance than an equivalent black applicant.

Instead of reporting these straightforward results from condition 7, how did DeSante compare allocations to black and white applicants? Below is an image from Table 2 of DeSante (2013), which reported results from eleven t-tests. Tests 3 and 4 provided the evidence for DeSante's claim that, "compared to African Americans, whites are not automatically perceived as more deserving of government assistance."

DeSante2013Table2

Here's what DeSante did in test 3: DeSante took the $556 allocated to Laurie in condition 7 when Laurie was paired with Keisha and compared that to the $546 allocated to Latoya in condition 10 when Latoya was paired with Keisha; that $9 advantage (bear with the rounding error) for Laurie over Latoya (when both applicants were paired with Keisha and neither had a Worker Quality Assessment) did not reach conventional levels of statistical significance.

Here's what DeSante did in test 4: DeSante took the $587 allocated to Emily in condition 4 when Emily was paired with Laurie and compared that to the $600 allocated to Keisha in condition 7 when Keisha was paired with Laurie; that $12 advantage for Keisha over Emily (when both applicants were paired with Laurie and neither had a Worker Quality Assessment) did not reach conventional levels of statistical significance.

So which of these three tests is the best test? My test had more observations, compared within instead of across conditions, and had a lower standard error. But DeSante's tests are not wrong or meaningless: the problem is that tests 3 and 4 provide incomplete information for the purposes of testing for racial bias against applicants with no reported Worker Quality Assessment.

---

Here's the next part of that quote from DeSante (2013: 343):

Instead, the way hard work and "laziness" are treated is conditioned by race: whites gain more for the same level of effort, and blacks are punished more severely for the same level of "laziness."

Here's what DeSante did to produce this inference. Emily received a mean allocation of $587 in condition 4 when paired with Laurie and neither applicant had a Worker Quality Assessment; but hard-working Emily received $711 in condition 6 when paired with lazy Laurie. This $123 difference can be interpreted as a reward for Emily's hard work, at least in relation to Laurie's laziness.

Now we do the same thing for Keisha paired with Laurie: Keisha received a mean allocation of $600 in condition 7 when paired with Laurie and neither applicant had a Worker Quality Assessment; but hard-working Keisha received $607 in condition 9 when paired with lazy Laurie. This $7 difference can be interpreted as a reward for Keisha's hard work, at least in relation to Laurie's laziness.

Test 7 indicates that the $123 reward to Emily for her hard work was larger than the $7 reward to Keisha for her hard work (p=0.03).

But notice that DeSante could have conducted another set of comparisons:

Laurie received a mean allocation of $556 in condition 7 when paired with Keisha and neither applicant had a Worker Quality Assessment; but hard-working Laurie received $620 in condition 8 when paired with lazy Keisha. This $64 difference can be interpreted as a reward for Laurie's hard work, at least in relation to Keisha's laziness.

Now we do the same thing for Latoya paired with Keisha: Latoya received a mean allocation of $546 in condition 10 when paired with Keisha and neither applicant had a Worker Quality Assessment; but hard-working Latoya received $627 in condition 11 when paired with lazy Keisha. This $81 difference can be interpreted as a reward for Latoya's hard work, at least in relation to Keisha's laziness.

The $16 difference between Laurie's $64 reward for hard work and Latoya's $81 reward for hard work (rounding error, again) is not statistically significant at conventional levels (p=0.76). The combined effect of the DeSante test and my alternate test is not statistically significant at conventional levels (effect of $49, p=0.20), so -- in this dataset -- there is a lack of evidence at a statistically significant level for the claim that "whites gain more for the same level of effort."

I conducted a similar set of alternate tests for the inference that "blacks are punished more severely for the same level of "laziness"; the effect size was smaller in my test compared to DeSante's test, but evidence for the the combined effect was believable: a $74 effect, with p=0.06.

---

Here's the next part of that quote from DeSante (2013: 343):

Second, and consistent with those who take the "principled ideology" approach to the new racism measures, the racial resentment scale is shown to predict a desire for smaller government and less government spending. However, in direct opposition to this ideology-based argument, this effect is conditional upon the race of the persons placing demands on the government: the effect of racial resentment on a desire for a smaller government greatly wanes when the beneficiaries of that government spending are white as opposed to black. This represents strong evidence that racial resentment is more racial animus than ideology.

DeSante based this inference on results reported in Table 3, reproduced below:

DeSante2013Table3

Notice the note at the bottom: "White respondents only." DeSante reported results in Table 3 based on responses only from respondents coded as white, but reported results in Table 2 based on responses from respondents coded as white, black, Asian, Native American, mixed race, or Other. Maybe there's a good theoretical reason for changing the sample. DeSante's data and code are posted here if you are interested in what happens to p-values when Table 2 results are restricted to whites and Table 3 results include all respondents.

But let's focus on the bold RRxWW line in Table 3. RR is racial resentment, and WW is a dichotomous variable for the conditions in which both applicants were white. Model 3 includes categories for WW (two white applicants paired together), BB (two black applicants paired together), and WB (one white applicant paired with one black applicant); this is very important, because these included terms must be interpreted in relation to the omitted category that I will call NN (two unnamed applicants paired together). Therefore, the -337.92 coefficient on the RRxWW variable in model 3 indicates that -- all other model variables held constant -- white respondents allocated $337.92 less to offset the state budget deficit when both applicants were white compared to when both applicants were unnamed.

The -196.43 coefficient for the RRxBB variable in model 3 indicates that -- all other model variables held constant -- white respondents allocated $196.43 less to offset the state budget deficit when both applicants were black compared to when both applicants were unnamed. This -$196.43 coefficient did not reach statistical significance, but the coefficient is important because the bias in favor of the two white applicants relative to the two black applicants is only -$337.92 minus -$196.43; so whites allocated $141.49 less to offset the state budget deficit when both applicants were white compared to when both applicants were black, but the p-value for this difference was 0.41.

---

Here's a few takeaways from the above analysis:

1. The limited choice of statistical tests reported in DeSante (2013) produced inferences that overestimated the extent of bias against black applicants and missed evidence of bias against white applicants.

2. Takeaway 1 depends on the names reflecting only race of the applicant. But the names might have reflected something other than race; for instance, in condition 10, Keisha received a mean allocation $21 higher than the mean allocation to Latoya (p=0.03): such a difference is not expected if Keisha and Latoya were "all else equal."

3. Takeaway 1 would likely not have been uncovered had the AJPS not required the posting of data and replication files from its published articles.

4. Pre-registration would eliminate suspicion about research design decisions, such as decisions to restrict only some analyses to whites and to report some comparisons but not others.

---

In case you are interested in reproducing the results that I discussed, the data are here, code is here, and the working paper is here. Comments are welcome.

---

UPDATE (Nov 2, 2014)

I recently received a rejection for the manuscript describing the results reported above; the second reviewer suggested portraying the raw data table as a graph: I couldn't figure out an efficient way to do that, but the suggestion did get me to realize a good way to present the main point of the manuscript more clearly with visuals.

The figure below illustrates the pattern of comparison for DeSante 2013 tests 1 and 2: solid lines represent comparisons reported in DeSante 2013 and dashed lines represent unreported equivalent or relevant comparisons; numbers in square brackets respectively indicate the applicant and the condition, so that [1/2] indicates applicant 1 in condition 2.

 

Tests 1 and 2

---

The figure below indicates the pattern of reported and unreported comparisons for black applicants and white applicants with no Worker Quality Assessment: the article reported two small non-statistically significant differences when comparing applicants across conditions, but the article did not report the larger statistically significant difference favoring the black applicant when a black applicant and a white applicant were compared within conditions.

Tests 3 and 4---

The figure below indicates the pattern of reported and unreported comparisons for the main takeaway of the article. The left side of the figure indicates that one of the black applicants received a lesser reward for an excellent Worker Quality Assessment and received a larger penalty for a poor Worker Quality Assessment, compared to the reward and penalty for the corresponding white applicant; however, neither the lesser reward for an excellent Worker Quality Assessment nor the larger penalty for a poor Worker Quality Assessment was present at a statistically significant level in the comparisons on the right, which were not reported in the article (p=0.76 and 0.31, respectively).

Tests Rest---

Data for the reproduction are here. Reproduction code is here.

---

UPDATE (Mar 8, 2015)

The above analysis has been published here by Research & Politics.

Rattan et al. (2012) reported evidence, as indicated in the abstract, that:

...simply bringing to mind a Black (vs. White) juvenile offender led participants to view juveniles in general as significantly more similar to adults in their inherent culpability and to express more support for severe sentencing.

Data for the study were collected by the Time Sharing Experiments for the Social Sciences and are located here.*

In this post, I present results of an attempt to reproduce and extend this study.

---

The first takeaway is substantive: the reproduction and extension suggest that Rattan et al. might have applied the incorrect theory to explain results because their reported analyses were limited to white respondents.

Here's part of a figure from Rattan et al. (2012):

RattanL

The figure indicates that white respondents in the study expressed more support for life in prison without parole when primed to think about a black juvenile offender than when primed to think about a white juvenile offender. The authors appear to attribute this racial bias to stereotypic associations:

The results also extend the established literature in social psychology examining the cognitive association between the social category "Black" and criminality, and raise the possibility that this race-crime association may be at odds with lay people’s typical notions about the innocence of juveniles. [citation removed]

But here are the results when observations from both white and black respondents are reported:

Blacks offered more support for life in prison without parole when primed to think of a white juvenile offender than when primed to think of a black juvenile offender. If there is a generalized effect here, it does not appear that the effect is caused by stereotypic associations of criminality with the social category "black." It seems more likely that the racial bias detected in the study reflected ingroup favoritism or outgroup antagonism among both whites and blacks.

Check out the working paper here for more detail on the results, a more nuanced breakdown of white responses, background on related research, and policy implications; feel free to comment on this blog post or to email comments regarding the working paper.

---

The second takeaway is methodological: the reproduction and extension suggest that this study seems to suffer from researcher degrees of freedom.

One of the first things that I noticed when comparing the article to the data was that the article mentioned two dependent variables but there appeared to be four dependent variables in the survey; based on my analyses, the two dependent variables not mentioned in the study did not appear to provide evidence of racial bias. I suppose that I can understand the idea that these null findings reflect "failed" experiments in some way, but I'd have liked as a reader to have been informed that racial bias was detected for only half of the dependent variables.

I also noticed that the dataset had three manipulation check items, but only one of these manipulation checks was used in the analysis; of course, the manipulation check that was used was the most important manipulation check (remembering the race of the juvenile offender), but I'd have liked as a reader to have been informed that manipulation checks for the juvenile offender's age and crime were unused.

And I noticed -- and this is more a problem with SPSS and statistics training than with the Rattan et al. analysis -- that the weighting of observations in SPSS resulted in incorrectly deflated p-values. I discussed this problem here and here and here; data for the first link were the Rattan et al. (2012) data.

---

* There are two datasets for the Rattan et al. (2012) study. I received the full dataset in an email from TESS, and this dataset was previously posted at the TESS archive; the dataset currently posted at the TESS archive contains a weight2 variable for only white respondents who met participation criteria, provided complete data, and finished the survey in one minute or longer.

---

UPDATE (Mar 15, 2015)

Replaced the figure with results for white and black respondents, which should have ranged from 1 to 6. The original figure incorrectly ranged from 0 to 6.

Andrew Gelman linked to a story (see also here) about a Science article by Annie Franco, Neil Malhotra, and Gabor Simonovits on the file drawer problem in the Time Sharing Experiments for the Social Sciences. TESS fields social science survey experiments, and sometimes the results of these experiments are not published.

I have been writing up some of these unpublished results but haven't submitted anything yet. Neil Malhotra was kind enough to indicate that I'm not stepping on their toes, so I'll post what I have so far for comment. From what I have been able to determine, none of these studies discussed below were published, but let me know if I am incorrect about that. I'll try to post a more detailed write-up of these results soon, but in the meantime feel free to contact me for details on the analyses.

I've been concentrating on bias studies, because I figure that it's important to know if there is little-to-no evidence of bias in a large-scale nationally-representative sample; not that such a study proves that there's no bias, but reporting these studies helps to provide a better estimate for the magnitude of bias. It's also important to report evidence of bias in unexpected directions.

 

TESS 241

TESS study 241, based on a proposal from Stephen W. Benard, tested for race and sex bias in worker productivity ratings. Respondents received a vignette about the work behavior of a lawyer whose name was manipulated in the experimental conditions to signal the lawyer's sex and race: Kareem (black male), Brad (white male), Tamika (black female), and Kristen (white female). Respondents were asked how productive the lawyer was, how valuable the lawyer was, how hardworking the lawyer was, how competent the lawyer was, whether the lawyer deserved a raise, how respected the lawyer was, how honorable the lawyer was, how prestigious the lawyer was, how capable the lawyer was, how intelligent the lawyer was, and how knowledgeable the lawyer was.

Substantive responses to these eleven items were used to create a rating scale, with items standardized before summing and cases retained if there were substantive responses for at least three items; this scale had a Cronbach's alpha of 0.92. The scale was standardized so that its mean and standard deviation were respectively 0 and 1; higher values on the scale indicate more favorable evaluations.

Here is a chart of the main results, with experimental targets on the left side:

benardThe figure indicates point estimates and 95% confidence intervals for the mean level of evaluations in experimental conditions for all respondents and disaggregated groups; data were not weighted because the dataset did not contain a post-stratification weight variable.

The bias in this study is against Brad relative to Kareem, Kristen, and Tamika.

 

TESS 392

TESS study 392, based on a proposal from Lisa Rashotte and Murray Webster, tested for bias based on sex and age. Respondents were randomly assigned to receive a picture and text description of one of four target persons: Diane Williams, a 21-year-old woman; David Williams, a 21-year-old man; Diane Williams, a 45-year-old woman; and David Williams, a 45-year-old man. Respondents were asked to rate the target person on nine traits, drawn from Webster and Driskell (1983): intelligence, ability in situations in general, ability in things that the respondent thinks counts, capability at most tasks, reading ability, abstract abilities, high school grade point average, how well the person probably did on the Federal Aviation Administration exam for a private pilot license, and physical attractiveness. For the tenth item, respondents were shown their ratings for the previous nine items and given an opportunity to change their ratings.

The physical attractiveness item was used as a control variable in the analysis. Substantive responses to the other eight items were used to create a rating scale, with items standardized before summing and cases retained if the case had substantive responses for at least five items; this scale had a Cronbach's alpha of 0.91. The scale was standardized so that its mean and standard deviation were respectively 0 and 1; higher values on the scale indicate more favorable evaluations.

Here is a chart of the main results, with experimental targets on the left side:

rashotte The figure indicates point estimates and 95% confidence intervals for the mean level of evaluations in experimental conditions for all respondents and disaggregated groups; data were weighted. The bias in this study, among women, is in favor of older persons and, among men, is in favor of the older woman. Here's a table of 95% confidence intervals for mean rating differences for each comparison:

rashottetable

 

TESS 012

TESS study 012, based on a proposal from Emily Shafer, tested for bias for or against married women based on the women's choice of last name after marriage. The study's six conditions manipulated a married woman's last name and the commitment that caused the woman to increase the burden on others. Conditions 1 and 4, 2 and 5, and 3 and 6 respectively reflected the woman keeping her last name, hyphenating her last name, or adopting her husband's last name; the vignette for conditions 1, 2, and 3 indicated that the woman's co-workers were burdened because of the woman's marital commitment, and the vignette for conditions 4, 5, and 6 indicated that the woman's husband was burdened because of the woman's work commitment.

Substantive responses to items 1, 2, 5A, and 6A were used to create an "employee evaluation" scale, with items standardized before summing and cases retained if there were substantive responses for at least three items; this scale had a Cronbach's alpha of 0.73. Substantive responses to items 3, 4, 5B, and 6B were used to create a "wife evaluation" scale, with items standardized before summing and cases retained if there were substantive responses for at least three items; this scale had a Cronbach's alpha of 0.74. Both scales were standardized so that their mean and standard deviation were respectively 0 and 1 and then reversed so that higher scores indicated a more positive evaluation.

Results are presented for the entire sample, for men, for women, for persons who indicated that they were currently married or once married and used traditional last name patterns (traditional respondents), and for persons who indicated that they were currently married or once married but did not use traditional last name patterns (non-traditional respondents); name patterns were considered traditional for female respondents who changed their last name to their spouse's last name (with no last name change by the spouse), and male respondents whose spouse changed their last name (with no respondent last name change).

Here is a chart of the main results, with experimental conditions on the left side:

shafer

The figure displays point estimates and 95% confidence intervals for weighted mean ratings for each condition, adjusted for physical attractiveness. Not much bias detected here, except for men's wife evaluations when the target woman kept her last name.

 

TESS 714

TESS study 714, based on a proposal from Kimberly Rios Morrison, tested whether asking whites to report their race as white had a different effect on multiculturalism attitudes and prejudice than asking whites to report their ethnicity as European American. See here for published research on this topic.

Respondents were randomly assigned to one of three groups: respondents in the European American prime group were asked to identify their race/ethnicity as European American, American Indian or Alaska Native, Asian American or Pacific Islander, Black or African American, Hispanic/Latino, or Other; respondents in the White prime group were asked to identify their race/ethnicity from the same list but with European American replaced with White; and respondents in the control group were not asked to identify their race/ethnicity.

Respondents were shown 15 items regarding ethnic minorities, divided into four sections that we'll call support for multiculturalism, support for pro-ethnic policies, resentment of ethnic minorities, and closeness to whites. Scales were made for items from the first three sections; to create a "closeness to whites" scale, responses to the item on closeness to ethnic minorities were subtracted from responses to the item on closeness to nonminorities, to indicate degree of closeness to whites; this item was then standardized.

Here is a chart of the main results, with experimental conditions on the left side:

rios morrisonThe figure displays weighted point estimates and 95% confidence intervals. The prime did not have much influence, except for the bottom right graph.

---

There's a LOT of interesting things in the TESS archives. Comparing reported results to my own analyses of the data (not for the above studies, but for other studies) has illustrated the inferential variation that researcher degrees of freedom can foster.

One of the ways to assess claims of liberal bias in social science is to comb through data such as the TESS archives, which let us see what a sample of researchers are interested in and what a sample of researchers place into their file drawer. Researchers placing null results into a file drawer is ambiguous because we cannot be sure whether placement in the file drawer is due to the null results or to the political valence of the null results; however, researchers placing statistically significant results into a file drawer has much less ambiguity.

---

UPDATE (Sept 6, 2014)

Gábor Simonovits, one of the co-authors of the Science article, quickly and kindly sent me a Stata file of their dataset; that data and personal communication with Stephen W. Benard indicated that results from none of the four studies reported in this post have been published.

I came across an interesting site, Dynamic Ecology, and saw a post on self-archiving of journal articles.The post mentioned SHERPA/RoMEO, which lists archiving policies for many journals. The only journal covered by SHERPA/RoMEO that I have published in that permits self-archiving is PS: Political Science & Politics, so I am linking below to pdfs of PS articles that I have published.

---

This first article attempts to help graduate students who need seminar paper ideas. The article grew out of a graduate seminar in US voting behavior with David C. Barker. I noticed that several articles on the seminar reading list placed in top-tier journals but made an incremental theoretical contribution and used publicly-available data, which was something that I as a graduate student felt that I could realistically aspire to.

For instance, John R. Petrocik in 1996 provided evidence that candidates and parties "owned" certain issues, such as Democrats owning care for the poor and Republicans owning national defense. Danny Hayes extended that idea by using publicly-available ANES data to provide evidence that candidates and parties owned certain traits, such as Democrats being more compassionate and Republicans being more moral.

The original manuscript identified the Hayes article as a travel-type article in which the traveling is done by analogy. The final version of the manuscript lost the Hayes citation but had 19 other ideas for seminar papers. Ideas on the cutting room floor included replication and picking a fight with another researcher.

Of Publishable Quality: Ideas for Political Science Seminar Papers. 2011. PS: Political Science & Politics 44(3): 629-633.

  1. pdf version, copyright held by American Political Science Association

---

This next article grew out of reviews that I conducted for friends, colleagues, and journals. I noticed that I kept making the same or similar comments, so I produced a central repository for generalized forms of these comments in the hope that -- for example -- I do not review any more manuscripts that formally list hypotheses about the control variables.

Rookie Mistakes: Preemptive Comments on Graduate Student Empirical Research Manuscripts. 2013. PS: Political Science & Politics 46(1): 142-146.

  1. pdf version, copyright held by American Political Science Association

---

The next article grew out of friend and colleague Jonathan Reilly's dissertation. Jonathan noticed that studies of support for democracy had treated don't know responses as if the respondents had never been asked the question. So even though 73 percent of respondents in China expressed support for democracy, that figure was reported as 96 percent because don't know responses were removed from the analysis.

The manuscript initially did not include imputation of preferences for non-substantive responders, but a referee encouraged us to estimate missing preferences. My prior was that multiple imputation was "making stuff up," but research into missing data methods taught me that the alternative -- deletion of cases -- assumed that cases were missing at random, which did not appear to be true in our study: the percent of missing cases in a country correlated at -0.30 and -0.43 with the country's Polity IV democratic rating, which meant that respondents were more likely to issue a non-substantive response in countries where political and social liberties are more restricted.

Don’t Know Much about Democracy: Reporting Survey Data with Non-Substantive Responses. 2012. PS: Political Science & Politics 45(3): 462-467. Second author, with Jonathan Reilly.

  1. pdf version, copyright held by American Political Science Association

My previous post discussed p-values in SPSS and Stata for probability-weighted data. This post provides more information on weighting in the base module of SPSS. Data in this post are from Craig and Richeson (2014), downloaded from the TESS archives; SPSS commands are from personal communication with Maureen Craig, who kindly and quickly shared her replication code.

Figure 2 in Craig and Richeson's 2014 Personality and Social Psychology Bulletin article depicts point estimates and standard errors for racial feeling thermometer ratings made by white non-Hispanic respondents. The article text confirms what the figure shows: whites in the racial shift condition (who were exposed to a news article titled, "In a Generation, Racial Minorities May Be the U.S. Majority") rated Blacks/African Americans, Latinos/Hispanics, and Asian-Americans lower on the feeling thermometers at a statistically significant level than whites in the control condition (who were exposed to a news article titled, "U.S. Census Bureau Reports Residents Now Move at a Higher Rate").

CraigRicheson2014PSPB

Craig and Richeson generated a weight variable that retained the original post-stratification weights for non-Hispanic white respondents but changed the weight to 0.001 for respondents who were not non-Hispanic white. Figure 2 results were drawn from the SPSS UNIANOVA command, which "provides regression analysis and analysis of variance for one dependent variable by one or more factors and/or variables," according to the SPSS web entry for the UNIANOVA command.

The SPSS output below represents a weighted analysis in the base SPSS module for the command UNIANOVA therm_bl BY dummyCond WITH cPPAGE cPPEDUCAT cPPGENDER, in which therm_bl, dummyCond, cPPAGE, cPPEDUCAT, and cPPGENDER respectively indicate numeric ratings on a 0-to-100 feeling thermometer scale for blacks, a dummy variable indicating whether the respondent received the control news article or the treatment news article, respondent age, respondent education on a four-level scale, and respondent sex. The 0.027 Sig. value for dummyCond indicates that the mean thermometer rating made by white non-Hispanics in the control condition was different at the 0.027 level of statistical significance from the mean thermometer rating made by white non-Hispanics in the treatment condition.

CR2014PSPB

The image below presents results for the same analysis conducted using probability weights in Stata, with weightCR indicating a weight variable mimicking the post-stratification weight created by Craig and Richeson: the corresponding p-value is 0.182, not 0.027, a difference due to the Stata p-value reflecting a probability-weighted analysis and the SPSS p-value reflecting a frequency-weighted analysis.

CR2014bl0

So why did SPSS return a p-value of 0.027 for dummyCond?

The image below is drawn from online documentation for the SPSS weight command. The second bullet point indicates that SPSS often rounds fractional weights to the nearest integer. The third bullet point indicates that SPSS statistical procedures ignore cases with a weight of zero, so cases with fractional weights that round to zero will be ignored. The first bullet point indicates that SPSS arithmetically replicates a case according to the weight variable: for instance, SPSS treats a case with a weight of 3 as if that case were 3 independent and identical cases.

 weightsSPSS

Let's see if this is what SPSS did. The command gen weightCRround = round(weightCR) in the Stata output below generates a variable with the values of weightCR rounded to the nearest integer. When the Stata command used the frequency weight option with this rounded weight variable, Stata reported p-values identical to the SPSS p-values.

CR2014bl2

The Stata output below illustrates what happened in the above frequency-weighted analysis. The expand weightCRround command replicated each dataset case n-1 times, in which n is the number in the weightCRround variable: for example, each case with a weightCRround value of 3 now appears three times in the dataset. Stata retained one instance of each case with a weightCRround value of zero, but SPSS ignores cases with a weight of zero for weighted analyses; therefore, the regression excluded cases with a zero value for weightCRround.

Stata p-values from a non-weighted regression on this adjusted dataset were identical to SPSS p-values reported using the Craig and Richeson commands.

CR2014bl3

So how much did SPSS alter the dataset? The output below is for the original dataset: the racial shift and control conditions respectively had 233 and 222 white non-Hispanic respondents with full data on therm_bl, cPPAGE, cPPEDUCAT, and cPPGENDER; the difference in mean therm_bl ratings across conditions was 3.13 units.

CR2014bl4before

The output below is for the dataset after executing the round and expand commands: the racial shift and control conditions respectively had 189 and 192 white non-Hispanic respondents with a non-zero weight and full data on therm_bl, cPPAGE, cPPEDUCAT, and cPPGENDER; the difference in mean therm_bl ratings across conditions was 4.67, a 49 percent increase over the original difference of 3.13 units.

CR2014bl4after

---

Certain weighted procedures in the SPSS base module report p-values identical to p-values reported in Stata when weights are rounded, cases are expanded by those weights, and cases with a zero weight are ignored; other weighted procedures in the SPSS base module report p-values identical to p-values reported in Stata when the importance weight option is selected or when the analytic weight option is selected and the sum of the weights is 1.

(Stata's analytic weight option treats each weight as an indication of the number of observations represented in a particular case; for instance, an analytic weight of 4 indicates that the values for the corresponding case reflect the mean values for four observations; see here.)

Test analyses that I conducted produced the following relationship between SPSS output and Stata output.

SPSS weighted base module procedures that reported p-values identical to Stata p-values when weights were rounded, cases were expanded by those weights, and cases with a zero weight were ignored:

  1. UNIANOVA with weights indicated in the WEIGHT BY command

SPSS weighted base module procedures that reported p-values identical to Stata p-values when the importance weight or analytic weight option was selected and the sum of the weights was 1:

  1. Independent samples t-test
  2. Linear regression with weights indicated in the WEIGHT BY command
  3. Linear regression with weights indicated in the REGWT subcommand in the regression menu (weighted least squares analysis)
  4. UNIANOVA with weights indicated in the REGWT subcommand in the regression menu (weighted least squares analysis)

---

SPSS has a procedure that correctly calculates p-values with survey weights, as Jon Peck noted in a comment to the previous post. The next post will describe that procedure.

---

UPDATE (June 20, 2015)

Craig and Richeson have issued a corrigendum to the "On the Precipice of a 'Majority-Minority' America" article that had used incorrect survey weights.

Here are t-scores and p-values from a set of t-tests that I recently conducted in SPSS and in Stata:

Group 1 unweighted
t = 1.082 in SPSS (p = 0.280)
t = 1.082 in Stata (p = 0.280)

Group 2 unweighted
t = 1.266 in SPSS (p = 0.206)
t = 1.266 in Stata (p = 0.206)

Group 1 weighted
t = 1.79 in SPSS (p = 0.075)
t = 1.45 in Stata (p = 0.146)

Group 2 weighted
t = 2.15 in SPSS (p = 0.032)
t = 1.71 in Stata (p = 0.088)

There was no difference between unweighted SPSS p-values and unweighted Stata p-values, but weighted SPSS p-values fell under conventional levels of statistical significance that probability weighted Stata p-values did not (0.10 and 0.05, respectively).

John Hendrickx noted some problems with weights in SPSS:

One of the things you can do with Stata that you can't do with SPSS is estimate models for complex surveys. Most SPSS procedures will allow weights, but although these will produce correct estimates, the standard errors will be too small (aweights or iweights versus pweights). SPSS cannot take clustering into account at all.

Re-analysis of Group 1 weighted and Group 2 weighted indicated that t-scores in Stata were the same as t-scores in SPSS when using the analytic weight option [aw=weight] and the importance weight option [iw=weight].

---

SPSS has another issue with weights, indicated on the IBM help site:

If the weighted number of cases exceeds the sample size, tests of significance are inflated; if it is smaller, they are deflated.

This means that, for significance testing, SPSS treats the sample size as the sum of the weights and not as the number of observations: if there are 1,000 observations and the mean weight is 2, SPSS will conduct significance tests as if there were 2,000 observations. Stata with the probability weight option treats the sample size as the number of observations no matter the sum of the weights.

I multiplied the weight variable by 10 in the dataset that I have been working in. For this inflated weight variable, Stata t-scores did not change for the analytic weight option, but Stata t-scores did inflate for the importance weight option.

---

UPDATE (April 21, 2014)

Jon Peck noted in the comments that SPSS has a Complex Samples procedure. SPSS p-values from the Complex Samples procedure matched Stata p-values using probability weights:

SPSS

Stata

The Complex Samples procedure appears to require a plan file. I tried several permutations for the plan, and the procedure worked correctly with this setup:

SPSS-CS

---

UPDATE (May 30, 2015)

More here and here.