Evolutionary Psychology

Evolutionary Psychology is moving to SAGE. The new address is evp.sagepub.com. Submissions here.
Robert Kurzban

The Evolutionary Psychology Blog

By Robert Kurzban

Robert Kurzban is an Associate Professor at the University of Pennsylvania and author of Why Everyone (Else) Is A Hypocrite. Follow him on Twitter: @rkurzban

A Paper You Should Read (p=1.0)

Published 29 September, 2011

Run don’t walk and read a new paper by Simmons, Nelson and Simonsohn to appear in Psychological Science. The title of the paper is “False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant” (Their italics on undisclosed.)

The moral of this story is reasonably easily told. Scientists, they argue, have what they call “researcher degrees-of-freedom,” all the little decisions you make as you collect and report data, including when to stop running subjects, which data points to exclude, what counts as an “outlier,” and so on. By making just the right (or, perhaps, wrong) decisions, scientists can vastly increase the chance of getting a reportable statistically significant effect (i.e., a p-value less than the mystical level of .05).

The paper has the virtue of being not only important but also a little on the whimsical side (starting, indeed, with the note regarding the order authorship, which they report is “alphabetical (controlling for father’s age [reverse coded]).”) They report a study that they conducted in which their subjects listened to one of two songs (“When I’m 64,” (rendered in the posted manuscript, interestingly, as “When I am 64” – without the contraction – no doubt due to the meddling of an over-zealous APA copy-editor) or “Kalimba,” which they gloss as “an instrumental song that comes free with the Windows 7 operating system.” Who knew?) Subjects then indicated their birthdate and their father’s age. Their finding? They report: “People were nearly a year-and-a-half younger after listening to “When I am 64” rather than “Kalimba.” (This effect seems particularly poignant to me today for some reason… maybe I’ll listen to some Beatles…)

How’d they do it? (Or, rather, “How did they do it?” for APA purists.) First, they took multiple dependent measures, and only reported the one that “worked.” (Of course, no one reading this blog would ever do such a thing.) Second, they took advantage of the fact that they could stop gathering data whenever they wanted. For many people involved in social science research, this decision is not made before data collection begins, and can be more or less easily modified. The problem, of course, is that by chance, p will be less than .05 during some part of the data gathering process (especially if you’re only going to use one tail) even if there is no real effect. Stopping when this occurs, then, allows the researcher to use this degree of freedom – when to stop gathering data –to get significant results even when there is nothing to be found. For this study, they checked after every 10 observations to see what they had, and stopped when things looked good.

In short, the fact that so many decisions are left up to researchers allows for the possibility of many false positives. The authors suggest a number of policy changes for both authors and reviewers to improve things, which I think is a good idea, since the problem might be even worse than it first appears.

If we suppose that it’s true that degrees of freedom allow researchers to “find” significant results, and we also assume that the top journals prefer to publish the surprising stuff, then findings that are striking but ultimately false positives might be finding their way to very visible outlets. If they are false positives, then of course others who try to replicate won’t get the effect. However, these failures to replicate are unlikely to be of interest to editors of these same top journals. In the wake of the publication of Bem’s ESP paper, for instance, other researchers tried to publish failures to replicate in the journal in which it appeared, the Journal of Personality and Social Psychology. New Scientist ran a story about this, which I’m guessing reflects a fairly typical experience. In essence, the journal wasn’t interested because they don’t, as a matter of policy, print replications. Failures to replicate, if they can find a home at all, tend to be published in outlets you’ve never heard of.

This leaves consumers of science with the impression that the published (likely) false positive is the current state of the art. Science is supposed to be self-correcting. To the extent that surprising but false findings get wide attention, but the corrections are buried, the chance that such errors will be corrected are much lower.

The Simmons et al paper is an important piece, with broad implications for how social science is practiced and consumed. It should be required reading for social scientists of all stripes.

  • Lorenzo

    Funny. I read this entry while procrastinating from writing my master’s degree presentation on a similar issue: stopping rules and the problems of estimating effect size. Guess I can’t escape from this topic!

    Briefly put, when there is a large publication bias (such as “only publish results which reject the null”), and researchers are not careful with their sampling techniques, it will result in published results that grossly overestimate the true size of the effect by as much as seven times over! This is usually followed by puzzled researchers unable to understand why they aren’t able to replicate results with such a large effect size – and therefore never even submit their manuscripts for publication. No wonder much of the research in social sciences is in such a Helter Skelter state.

    Oh, and happy birthday!

    • Robert Kurzban

      Agreed, and thanks…

  • http://www.psychologytoday.com/blog/humor-sapiens Gil

    I agree that there is a puclication bias toward a more sensational, surprising results. I don’t have a problem with that, as long as journal allow for replicated studies to be published. The problem is also that many researchers do not want to replicate the same studies because it’s not novel enough. Also, I would be more incline to puvlished papers with clear methodology and rational and who stem from well established theories. Papers that just show that two groups are different for whatever reason shouln’t merit publication.

  • Jesse Marczyk

    A journal that doesn’t print replications (or failures to replicate) as a matter of policy? That hardly seems right. I can understand if most researchers want to do their own work and try and gain status for being considered an insightful genius, but if someone is willing to attempt a replication, I don’t see why they should be shut out from day 1 in terms of publication.

  • Pingback: Link: More Proof that Statistical Significance Isn’t Always…Significant. Plus: bonus comic! | A User's Guide to the Liberal Arts

  • http://www.indiana.edu/~kruschke/ John K. Kruschke

    There’s much to be liked in that paper, but two conclusions in particular should be challenged, and are,
    here
    (http://doingbayesiandataanalysis.blogspot.com/2011/10/false-conclusions-in-false-positive.html)
    [The challenged conclusions are about p values and Bayesian analysis.]

Copyright 2011 Robert Kurzban, all rights reserved.

Opinions expressed in this blog do not reflect the opinions of the editorial staff of the journal.