# Penalizing P Values

20 Nov 2013# Penalizing P Values

Ioannidis' paper suggesting that most published results in medical research are not true is now high profile enough that even my dad, an artist who wouldn't know a test statistic if it hit him in the face, knows about it. It has even shown up recently in the Economist as a cover article and plays directly into the “decline effect” discussed in a cover story in the New Yorker from 2010. Something is seriously wrong with science if only a small fraction of papers can actually be replicated.

But, placed in the context of the “decline effect,” this result makes sense. And it is a fundamental aspect and potential flaw in the way frequentist inference treats hypotheses.

Using a wild example, suppose I return from a walk in the woods and report an encounter with bigfoot. Now, while it is possible that bigfoot is real, it seems unlikely. But I have some blurry pictures and video of something moving off in the bush. I claim that this is evidence that bigfoot is real.

I show you my evidence of bigfoot and tell you about my encounter. You know me and know that I am fairly sane and always wear my glasses. You think it is unlikely that I would make this up or mistake a deer for bigfoot. You think there is less than a 5% chance that I would make up something this or more convincing given that bigfoot is not real. Therefore, the evidence would suggest that you reject the null that bigfoot is not real.

Hopefully, you don't think that is reasonable. But that is exactly how frequentist inference treats evidence for or against the null. The p value is simply the \( P(\theta \geq \hat{\theta} | H_0) \). The claim that bigfoot is real is given as much credibility as the claim that smoking causes cancer (RA Fisher might think that is reasonable but the rest of us have reason for concern). We would probably conclude that it was much more likely that I saw a deer or a hoax then I saw an actual bigfoot.

This becomes a problem for a few reasons

- We notice things more when they are unexpected
- We report things more when they are unexpected
- Many things that are unexpected are unexpected for a reason

This problem is especially serious when people “throw statistics” at data with the goal of making causal inference without using a priori theory as a guide. They find that there is a relationship between X and Y that is significant at \( \alpha = 0.05 \) and publish.

The field of Bayesian statistics provides a radically different form of inference that can potentially be used to address this question, but a simple back of the envelope penalty term may work just as well. Consider the simple cases of Bayes theorem,

\[ P(A | B) = \frac{P(B | A) P(A)}{P(B | A) P(A) + P(B | A^c) P(A^c)} \]

Taking \( P(B | A) \) to be the the same as the p value and \( P(A) \) being our a priori estimate of how likely the null hypothesis is true. What is the probability of rejecting the null when the null is not true? That is simply the power of the test with the given parameters. Suppose we set \( P(B | A) \) to some constant value (e.g., 0.05), and label anything with \( p \) less than that value is significant and anything greater is non-significant, e.g., \( P(B | A) = \alpha \). We can then calculate the rate of “false positive” results for that value of \( \alpha \) and power with

\[ P(H_0 | \hat{\theta}) = \frac{P(\theta \geq \hat{\theta} | H_0) P(H_0)}{P(\theta \geq \hat{\theta} | H_0) P(H_0) + P(\theta \geq \hat{\theta} | H_0^c) (1 - P(H_0))} \]

I wanted to get a feel for what this would look like and how these different parameters would interact. Also I needed an excuse to learn Shiny. You can see how this comes together and play with the values in the dynamic graph below.

I would encourage you to play around with it and see how the different values effect the probability that the alternative is true. You can see in the default case where we place equal weight on the null being true or false and have well powered studies, we do pretty well for ourselves. But as soon as you lower the power to a plausible 0.35 it the probability of the results being spurious doubles. If you set the power back at 0.80 but set the probability of the null being true at 90%, as Ioannidis suggests, we see the probability of a false positive at $\alpha = 0.05% is now roughly 35%! If you combine the low power and unlikeliness of the tested claims, the probability of false conclusions is well over 50% using a standard \( \alpha \).

As exciting as it would be to be known as the guy who found bigfoot, odds are that it was just some high schoolers out to play games with you. The null should be treated differently when we are making seemingly obvious and unexpected results. Even a simple sanity test as described here may reduce the surprisingly and unsustainable large number of later falsified or unreproduced findings. It certainly explains one process by which they may occur.