A Paper You Should Read (p=1.0)Published 29 September, 2011
Run don’t walk and read a new paper by Simmons, Nelson and Simonsohn to appear in Psychological Science. The title of the paper is “False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant” (Their italics on undisclosed.)
The moral of this story is reasonably easily told. Scientists, they argue, have what they call “researcher degrees-of-freedom,” all the little decisions you make as you collect and report data, including when to stop running subjects, which data points to exclude, what counts as an “outlier,” and so on. By making just the right (or, perhaps, wrong) decisions, scientists can vastly increase the chance of getting a reportable statistically significant effect (i.e., a p-value less than the mystical level of .05).
The paper has the virtue of being not only important but also a little on the whimsical side (starting, indeed, with the note regarding the order authorship, which they report is “alphabetical (controlling for father’s age [reverse coded]).”) They report a study that they conducted in which their subjects listened to one of two songs (“When I’m 64,” (rendered in the posted manuscript, interestingly, as “When I am 64” – without the contraction – no doubt due to the meddling of an over-zealous APA copy-editor) or “Kalimba,” which they gloss as “an instrumental song that comes free with the Windows 7 operating system.” Who knew?) Subjects then indicated their birthdate and their father’s age. Their finding? They report: “People were nearly a year-and-a-half younger after listening to “When I am 64” rather than “Kalimba.” (This effect seems particularly poignant to me today for some reason… maybe I’ll listen to some Beatles…)
How’d they do it? (Or, rather, “How did they do it?” for APA purists.) First, they took multiple dependent measures, and only reported the one that “worked.” (Of course, no one reading this blog would ever do such a thing.) Second, they took advantage of the fact that they could stop gathering data whenever they wanted. For many people involved in social science research, this decision is not made before data collection begins, and can be more or less easily modified. The problem, of course, is that by chance, p will be less than .05 during some part of the data gathering process (especially if you’re only going to use one tail) even if there is no real effect. Stopping when this occurs, then, allows the researcher to use this degree of freedom – when to stop gathering data –to get significant results even when there is nothing to be found. For this study, they checked after every 10 observations to see what they had, and stopped when things looked good.
In short, the fact that so many decisions are left up to researchers allows for the possibility of many false positives. The authors suggest a number of policy changes for both authors and reviewers to improve things, which I think is a good idea, since the problem might be even worse than it first appears.
If we suppose that it’s true that degrees of freedom allow researchers to “find” significant results, and we also assume that the top journals prefer to publish the surprising stuff, then findings that are striking but ultimately false positives might be finding their way to very visible outlets. If they are false positives, then of course others who try to replicate won’t get the effect. However, these failures to replicate are unlikely to be of interest to editors of these same top journals. In the wake of the publication of Bem’s ESP paper, for instance, other researchers tried to publish failures to replicate in the journal in which it appeared, the Journal of Personality and Social Psychology. New Scientist ran a story about this, which I’m guessing reflects a fairly typical experience. In essence, the journal wasn’t interested because they don’t, as a matter of policy, print replications. Failures to replicate, if they can find a home at all, tend to be published in outlets you’ve never heard of.
This leaves consumers of science with the impression that the published (likely) false positive is the current state of the art. Science is supposed to be self-correcting. To the extent that surprising but false findings get wide attention, but the corrections are buried, the chance that such errors will be corrected are much lower.
The Simmons et al paper is an important piece, with broad implications for how social science is practiced and consumed. It should be required reading for social scientists of all stripes.