NHST and screening tests
When discussing null hypothesis significant testing (NHST), textbooks often present the type I ($\alpha$) and type II ($\beta$) errors as the false positive and false negative rates, respectively, prompting readers to believe they are analogue to the same concepts used in diagnostic screening. However, in the latter these rates are actually interpreted to mean the transposed conditionals.
Suppose you are told that 5% of the population develop a disease and the sensibility of the available diagnostic tools is 90%. In Ioannidis’ world (Ioannidis, 2005), the probability of getting a positive result when the disease is present, $P(+ | H_1)$, and when it is absent, $P(+ | H_0)$, are $1-\beta$ and $1-\alpha$, respectively, while $P(H_0 | +)$ defined in (1) is termed the false positive probability or risk and $P(H_0 | +)$ is the positive predictive value. \[ P(H_0 | +) = \frac{P(+ | H_0) P(H_0)}{P(+ | H_0) P(H_0) + P(+ | H_1) P(H_1)} \qquad (1) \]
This way, if $\alpha$ was 1 - specificity = 0.05, for instance, then $P(H_0 | +) \approx .5$, which means that we would be wrong about half ot the time when diagnosing people. However, we encounter several problems if apply this reasoning to scientific research. The first concern is about the definition of the prior, $P(H_1)$. Ioannidis claims that ‘Before running an experiment, investigators should consider what they believe the chances are that they are testing a true rather than a non-true relationship’ (Ioannidis, 2005, p. 700). But this cannot be accomplished in this framework because $P(+ | H_1)$ is an statement concerning a parameter in a statistical model not a substantive research question. Conflating both is a known fallacy. Concretely, $P(+ | H_1)$ corresponds to the probability of rejecting $H_0$ when the true parameter is $H_1$ so $P(H_1)$ cannot be the prior probability of a true finding. But let’s ignore this caveat. Should then $P(H_1)$ be the proportion of true hypotheses in a field? How do we define the bounds of a field and how do we estimate that? Do any others characteristics like the methodological quality of the research and the author’s reputation count? Ioannidis does not address any of these questions to make possible the location of a substantive hypothesis in a reference group. As a consequence, the ambiguity precludes the method from stating any rigorous claim because researchers will have a lot of room for picking ad hoc criteria that may produce a desired result. On the other hand, it also seems silly that the stringency of the results of a thorough experiment becomes handicapped by the amount of independent shoddy research conducted in the field, whatever was the criterion to define it.
Another problem is interpreting the p-value as a dichotomous result. For Ioannidis, it does not matter whether the p-value is .49 nor .001, $P(H_0 | +)$ remains the same. Said another way, effects beyond the critical value don’t boost the positive predictive value, $P(H_1 | +)$. In another paper advocating for the screening testing approach, Colquhoun (2014) proposed $P(+ | H_1)$ to be the power for a given effect size (sensibility) and $P(+| H_0)$, the p-value ($1 - \text{specificity}$). This way, $+$ becomes a more incompatible value with the null than the observed one. However, this method runs counter a common sense of evidence. In Figure 1 the false positive risk is plotted as a function of power for a p-value of $.05$ and prior of $.5$.
To understand the problem, suppose two experiments yield the same p-value but the latter has more power for any effect size. Obviously, the former provides more evidence for an effect because the probability of the latter to reject the null was higher yet produced the same p-value. Contrary to this reasoning, we see in the figure that the false positive risk is inversely proportional to the power, the opposite trend we expect to find for a sound method of inference. So when a test is more capable of uncovering an effect, held everything else constant, the failure to achieve a stronger incompatibility from the null may still yield a lower false positive risk, wrongly implying that the evidence for an statistical effect is stronger. Furthermore, notice that either picking the effect size whose power minimizes the false positive risk or establishing an alternative a priori makes no sense because we would be entitled to infer any arbitrary large effect. Therefore, we must use the observed effect size, in which case observed power replaces power, and the terms $P(+ | H_0)$ and $P(+ | H_1)$ are identical up to a transformation, as shown in Figure 2.
This situation creates a paradox, first because we actually don’t have a prior for $H_1$ until we observe the data. And second, because the smaller the p-value the larger the observed power, we wrongly conclude that increasing the specificity also increases the sensibility. The trade-off between both is reverted. At this point, it’s clear we are not in a diagnostic context anymore. The terms $P(+ | H_0)$ and $P(+ | H_1)$ are ill-defined. By the way, recall that by the law of total probability, $P(+ | H_0) + P(+ | H_1) = 1$ if and only if $H_0$ and $H_1$ are exclusive and exhaust all possibilities. When the p-value exceeds $\alpha$, the likelihood ratio $\frac{P(+ | H_0)}{P(+ | H_1)}$ just tells us how many times more we observe $+$ under $H_0$ than under $H_1$ and we then feed this quantity with prior odds to get $P(H_0 | +)$ and $P(H_1 | +)$.
\[ \begin{aligned} \text{Odds ratio} &= \frac{P(+ | H_0)}{P(+ | H_1)} \frac{P(H_0)}{P(H_1)} \\ P(H_0 | +) &= \frac{\text{Odds ratio}}{1 + \text{Odds ratio}} \end{aligned} \qquad (2) \]
Anyway, if you pick $H_1$ after seeing the data then this framework does not fit into your statistical inquiry. Moreover, this argument shows that $P(H_0 | +)$ is not the probability you are wrong when claiming an effect. It is instead the probability you are wrong when claiming an effect in a universe where there are only two possible statistical hypotheses.
Colquhoun later refused this approach, termed the p-less-than method, and advocated for the p-equals method (Colquhoun, 2017). He argued that we actually should care about the exact probability density of the results and interpret $P(+ | H_0)$ and $P(+ | H_1)$ as the density of the data, $+$, under the null and under the alternative, respectively. Following the new approach, Figure 3 displays again the false positive risk against the power attached to an alternative.
The interpretation is exactly as before. We are just counting how many times more the observed result occurs under the null than under the alternative and pondering this ratio by prior odds. Nonetheless, in this case the trend falls in the sort of same mistake as before because for any fixed p-value the false positive risk should be a monotonic increasing function of power. Also, whenever we get a significant result, the effect size that minimizes the false positive risk is the observed one so we don’t solve any of the previous problems, which are: the absence of criteria to define the reference group a substantive hypothesis belongs to, the trend favouring larger effects for a given p-value (known as the Mountains out of Molehills fallacy (Mayo, 2018, p. 240)), the ad-hoc choice of the alternative $H_1$ after seeing the data that precludes us from calling $P(H_1)$ a prior, the inverse proportionality between the observed power and the p-value that makes the sensibility and specificity positively related and the assumption that $H_0$ and $H_1$ exhaust the whole parameter space.
And there are at least three more reasons to dismiss this approach. Returning to the screening tests, it is important to understand that we don’t know the probability that a single person has the disease. Recall that in this context $P(H_0)$ is a frequency and saying that the probability a particular person has the disease is $p$ because there are $np$ people with that disease out of a population of $n$ people is like saying that a parameter is within a confidence interval with probability $p$ because the frequency the parameter lies inside is $p$. $P(H_0 | +)$ is just an overall probability, an statement about a population. We get the probability of a false positive not the probability a person got the disease. So when Ioannidis and Colquhoun say that you can compute the probability you are wrong given you claim an effect based on a particular statistically significant result, they actually take what they believe to be the proportion of wrong hypotheses developed in a reference group to mean the probability of every single hypothesis from that group. The second reason is that the p-value and the power of the test are not conditional probabilities, but this will be explained in a different post. The third one is that we need to compute no false probability risk because should never claim a discovery based on a solely result.
Futhermore, if we do bayesian inference, all this is akin to suggesting very skeptical priors on the model parameters. This way, domain expertise necessary for establishing informative priors is sacrificed in favor of an ill-defined sense that most research hypotheses, including our, are wrong. My advice is that if you are interested in quantifying the probability of an hypothesis being true, go full bayesian instead of resorting to this kind of frequenstein method.
For another comparison between NHST and screening tests, see Mayo & Morey (2017).
References
Colquhoun, D. (2014). An investigation of the false discovery rate and the misinterpretation of p-values. Royal Society Open Science, 1(3), 140216. https://doi.org/10.1098/rsos.140216
Colquhoun, D. (2017). The reproducibility of research and the misinterpretation of p-values. Royal Society Open Science, 4(12), 171085. https://doi.org/10.1098/rsos.171085
Ioannidis, J. P. A. (2005). Why Most Published Research Findings Are False. PLoS Medicine, 2(8). https://doi.org/10.1371/journal.pmed.0020124
Mayo, D. G. (2018). Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars. Cambridge University Press. https://doi.org/10.1017/9781107286184
Mayo, D. G., & Morey, R. D. (2017). A Poor Prognosis for the Diagnostic Screening Critique of Statistical Tests. https://osf.io/nepx9/