Updated July 11, 2016. An earlier version was more apocalyptic.
THIS IS HUGE. A large number of studies from the last 15 years of neuroscience research (via fMRI) could be INVALID!
A recent study in the journal PNAS looked at the three most commonly used software packages used with fMRI machines. Where they expected to find a normal familywise error rate of 5%, they found error rates up to 70%.
Here’s what the authors’ wrote:
“Using mass empirical analyses with task-free fMRI data, we have found that the parametric statistical methods used for group fMRI analysis with the packages SPM, FSL, and AFNI can produce FWE-corrected cluster P values that are erroneous, being spuriously low and inflating statistical significance. This calls into question the validity of countless published fMRI studies based on parametric clusterwise inference. It is important to stress that we have focused on inferences corrected for multiple comparisons in each group analysis, yet some 40% of a sample of 241 recent fMRI papers did not report correcting for multiple comparisons (26), meaning that many group results in the fMRI literature suffer even worse false-positive rates than found here (37).”
In a follow-up blog post, the authors estimated that up to 3,500 scientific studies may be affected, which is down from their initial published estimate of 40,000. The discrepancy results because only studies at the edge of statistical reliability are likely to have results that might be affected. For an easy-to-read review of their walk-back, Wired has a nice piece.
The authors also point out that there is more to worry about than those 3,500 studies. An additional 13,000 studies don’t use any statistical correction at all (so they’re not affected by the software glitch reported in the scientific paper). However, these 13,000 studies use an approach that “has familywise error rates well in excess of 50%.” (cited from the blog post)
Here’s what the authors say in their walk-back:
“So, are we saying 3,500 papers are “wrong”? It depends. Our results suggest CDT P=0.01 results have inflated P-values, but each study must be examined… if the effects are really strong, it likely doesn’t matter if the P-values are biased, and the scientific inference will remain unchanged. But if the effects are really weak, then the results might indeed be consistent with noise. And, what about those 13,000 papers with no correction, especially common in the earlier literature? No, they shouldn’t be discarded out of hand either, but a particularly jaded eye is needed for those works, especially when comparing them to new references with improved methodological standards.”
Let’s take a deep breadth here. Science works slowly and we need to see what other experts have to say in the coming months.
The authors reported that there were about 40,000 published studies in the last 15 years that might be affected. Of this amount, only some of 3,500 + 13,000 = 16,500 are affected. That’s 41% of published articles with a potential to have invalid results.
But, of course, in the learning field, we don’t care about all these studies as most of them have very little to do with learning or memory. Indeed, a search of the whole history of PsycINFO (a social-science database) finds a total of 22,347 articles mentioning fMRI at all. Searching for articles that have a learning or memory aspect culls this number down to 7,056. This is a very rough accounting, but it does put the overall findings in some perspective.
As the authors warn, it’s not appropriate to dismiss the validity of all the research articles, even if they’re in one of the suspect groups of studies. Instead, when looking at the potentially-invalidate articles, each one has to be examined individually to know whether it has problems.
Despite these comforting caveats, the findings by the scientists have implications for many neuroscience research studies over the past 15 years (when the bulk of neuroscience research has been done).
On the other hand, there truly haven’t been many neuroscience findings that have much practical relevance to the learning field as of yet. See my review for a critique of overblown claims about neuroscience and learning. Indeed, as I’ve argued elsewhere, neuroscience’s potential to aid learning professionals probably rests in the future. So, being optimistic, maybe these statistical glitches will end up being a good thing. First, perhaps they’ll propel greater scrutiny to research methodologies, improving future neuroscience research. Second, perhaps they’ll put the brakes on the myth-creating industrial complex around neuroscience until we have better data to report and utilize.
Still, a dark cloud of low credibility may settle over the whole neuroscience field itself, hampering researchers from getting funding, and making future research results difficult for practitioners to embrace. Time will tell.
Popular Press Articles Citing the Original Article (Published Before the Walk-Backs).
Here are some articles from the scientific press pointing out the potential danger:
From Wikipedia (July 11, 2016): “In statistics, family-wise error rate (FWER) is the probability of making one or more false discoveries, or type I errors, among all the hypotheses when performing multiple hypotheses tests.”