Too often the behavior of researchers can be compared to that of ardent slot-machine players who await results with the expectation of winning or losing. Among researchers, this kind of mindset is associated with a fixation on whether a finding is “statistically significant.”
When they triumph over the one-armed bandit, slot-machine players get to walk away from the machine with a financial reward. Those who yearn for “statistically significant” results, however, may walk away from a study having found something of “significance” without having obtained anything of value. “Statistically significant” findings may offer us no insights and may add little to our knowledge, because the issue is more complex than can be characterized by the presence or absence of something.
In the following reprint from the British Medical Journal, Sterne and Smith remind us that those who worship at the shrine of the P value may be paying homage to false gods. Research—particularly applied research—should yield not only findings that are probably real (which is the most that a P value can tell you), but findings that are meaningful. There also are implications for other related issues, including power analyses. For example, we hear researchers talk about using power analyses to determine the sample size that is needed to show a “significant effect,” when instead they should be discussing the sample size that is needed to show not only a “statistically significant” finding but a meaningful effect—in other words, a result that might matter!
The P value is emblematic of what is wrong with some forms of research—research in which relevance and utility seem less important than does a “positive finding.” For years, Physical Therapy, like many other journals, has essentially banned P values from abstracts. When P values are mentioned outside the context of the study question, the sample size, and the statistic, they can obscure rather than illuminate. In physical therapy, the yes-no mentality of the P value has even been propagated to reliability studies. Measurements should never be said to be “reliable” or “not reliable.” Measurements have errors within a window of acceptability, a fact that is missed when we approach them with the mindset of significance testing.
The Journal believes that the following article should be mandatory reading for all new researchers and that it should be one of those articles that veterans reread periodically. We thank our colleagues at the British Medical Journal for having published it and for allowing us to reprint it.
- Physical Therapy