Skip to Main Content

If you cast a wide enough net, you’ll find what looks like a prize-winning fish. But you’ll also catch a lot of seaweed, plastic debris, and maybe even a dolphin you didn’t mean to bring in.

Such is the dilemma of interpreting scientific results with statistics. The net, in this analogy, is the statistical concept of a “p-value.” And a growing chorus of experts says that scientific research is using too wide a net — and therefore publishing results that turn out to be false. But is the maligned p-value really to blame?


First, a little (we promise) statistics. At its very simplest, the p-value is what researchers use to assess the likelihood that, if a given result — the effect of a new medication, for example — wasn’t real, that you would get results like the ones you saw in your study. (If it seems that we’re having trouble explaining that in an easy-to-digest way, don’t worry, so do many experts.)

Although not iron-clad, biomedical scientists generally agree that a p-value of 0.05 suggests a high likelihood of statistical significance. And this is important. It’s what separates hyped-up-claims based on a single patient’s outcomes, for example, from carefully performed clinical trials on a large group of people. And it’s often the dividing line at which journals will publish a study and reporters will write about it.

But, a pair of researchers argue in a recent issue of Science, the p-value may be doing more harm than good. Statistician Andrew Gelman, of Columbia University, and Eric Loken, a psychologist at the University of Connecticut, say scientists have bought into a “fallacy” — that if a statistically significant result emerges from a “noisy” experiment, a.k.a. one with many variables that are difficult to account for, that result is by definition a sound one.


But, ironically, the very experiments for which we most need statistical help — that is, the ones with many variables interacting in complicated ways — are the ones where p-values are most likely to deceive.

“Statistically speaking, a statistical significant result obtained under highly noisy conditions is more likely to be an overestimate and can even be in the wrong direction,” Gelman told Retraction Watch. “In short: a finding from a low-noise study can be informative, while the finding at the same significance level from a high-noise study is likely to be little more than … noise.”

That kind of approach is how you end up with spurious correlations, such as linking the divorce rate in Maine with the per capita consumption of margarine. (What, that’s true?) Sometimes, this is referred to as p-hacking.

And Gelman and Loken are not alone in worrying about it. Longtime statistician Frank Harrell, chair of biostatistics at Vanderbilt, says “p-values have done significant harm to science.” Yale kidney researcher and epidemiologist F. Perry Wilson has called p-values a “hoax.”

But wait, says Steven McKinney, a statistician with British Columbia Cancer Agency Vancouver. Stop picking on the p-value, and pick on abuses of the p-value instead. “It’s not small p-values that are the problem, it is this repeated phenomenon of researchers publishing a result with a small p-value with no attendant discussion of whether the result is one of any scientific relevance and whether the appropriate amount of data was collected,” McKinney writes in a comment on Retraction Watch. “This is the phenomenon behind the current replication crisis.”

If we may, we think everyone involved has a point.

Fortunately, there’s a path to liberation from p-value tyranny. Hilda Bastian, of the National Library of Medicine, offers a five-step program for avoiding “p-value potholes.” The first, and most obvious, is to recognize that statistical “significance” does not equal importance. “You can have a statistically significant p-value of an utterly trivial difference — say, getting better from a week-long cold 10 minutes faster. You could call that “‘a statistically significant difference,’ but it’s no reason to be impressed,” Bastian writes.

Nor are p-values “truth,” says Bastian. They are evidence, yes, but not dispositive of anything. Conversely, a p-value above 0.05 might mean a lack of evidence, or it might not.

Confused yet? Maybe that’s a good thing. Getting statistics right is difficult — and requires careful thought, not just slapping on a p-value and calling it a day. It wasn’t always this way; p-values are only about 350 years old. They’re not the laws of physics. That doesn’t mean we could or should throw them out — although some have — but it means we can make them work better for us.

  • I have worked on many experiments in which we did not have the problem of too many significant relationships. P values serve a purpose – they provide a standard of evidence. The standard should be combined with other pieces of information to make decisions. However, without p values, we would be making decisions with no standards, which is not a good place to be. P values, in addition, have no alternative. Gelman is a Baysian. Often, Baysians use “uninformative priors”, which means that they are doing virtually the same tests as non-Baysians, just gussying them up with a fancy-schmancy name. In other situations, we need to select variables for decisions. Without p values, this cannot be done. So enough with the silly comments about p values. They have a place in scientific discussions.

    • While this is ‘old news’ scientifically, every year thousands of new scientists enter their studies or the job market and many will find it ‘new news’; not to mention the senior scientists who were not aware of their p-hacking approaches. Good to continue to push the subject and make lots of noise about the issue IMO.

  • No serious practitioner of commercial applied stats wastes time with “p-values.” Where money is on the line, we use stress-tested “expected value” calculations (probability estimates x payoff/payout).

Comments are closed.