Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2023 Dec 8;4(12):100878.
doi: 10.1016/j.patter.2023.100878.

The roles, challenges, and merits of the p value

Affiliations
Review

The roles, challenges, and merits of the p value

Oliver Y Chén et al. Patterns (N Y). .

Abstract

Since the 18th century, the p value has been an important part of hypothesis-based scientific investigation. As statistical and data science engines accelerate, questions emerge: to what extent are scientific discoveries based on p values reliable and reproducible? Should one adjust the significance level or find alternatives for the p value? Inspired by these questions and everlasting attempts to address them, here, we provide a systematic examination of the p value from its roles and merits to its misuses and misinterpretations. For the latter, we summarize modest recommendations to handle them. In parallel, we present the Bayesian alternatives for seeking evidence and discuss the pooling of p values from multiple studies and datasets. Overall, we argue that the p value and hypothesis testing form a useful probabilistic decision-making mechanism, facilitating causal inference, feature selection, and predictive modeling, but that the interpretation of the p value must be contextual, considering the scientific question, experimental design, and statistical principles.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Figure 1
Figure 1
Recent trends of p value, p hacking, and Bayesian evidence in scientific studies (A) The growth pattern of the p value in the past decade. Recent years have witnessed a considerable increase in articles consisting of topics related to p value, p hacking, and Bayesian evidence. Particularly, articles that discuss both p value and p hacking as well as those that discuss both p value and Bayesian evidence have grown in an exponential-like trajectory. We used the following literature search strategy. We first define three sets of keywords: PV (p value), PH (p hacking), and BE (Bayesian evidence). We then define publications in [1]–[4] as PV ∩ PH ∖ BE, PV ∩ BE ∖ PH, PH ∩ BE ∖ PH, and PV ∩ PH ∩ BE, respectively. The search used the advanced search function provided by Google Scholar, where, for example, the query PV ∩ PH ∖ BE is equivalent to entering “p value; p hacking -Bayesian evidence.” (B) The distribution of the p values across academic disciplines. The p values are widely used in 14 common subjects, noticeably in biological sciences, medical and health sciences, multidisciplinary fields, and psychological and cognitive sciences. Across different subjects, smaller p values (those between 0 and 0.025) seem to be more commonly reported than larger (albeit significant at 0.05) counterparts. Data for plotting (B) are from Head et al.
Figure 2
Figure 2
The fundamental goal of hypothesis testing in science (A) The triad of data-generating mechanism, observed data, and uncovering the true mechanism via hypothesis testing. One chief goal of scientific investigation is to understand the underlying (biological or physical) mechanism that gives rise to the observed data. When one has no or only preliminary knowledge about the mechanism M and its parameters θ, one hopes to learn about the mechanism and its parameters using observed data x, which are generated via x=M(θ)+ε, where ε indicates noise and measurement errors. To do so, one proposes a model M with parameters θ (which, one hopes, approximate M and θ), and, given data x, obtains the estimated parameter θˆ. One then performs hypothesis tests to examine how close the estimated parameter may be to θ. (B) The true but unclear data-generating mechanism. Using CyTOF (cytometry by time of flight) mass cytometry, rare earth metal isotopes are coupled to antibodies via a chelator tag, which is detected by a mass cytometer to quantitatively assess the concentrations of antibody-specific antigen present on a given cell. From left to right: cells are first incubated with a cocktail of metal isotope-labeled antibodies, washed to remove unbound antibodies, and then sprayed into droplets using a nebulizer. The droplets are dried in the heated spray chamber, allowing antibody-bound cells to individually enter the inductively coupled plasma (ICP) flame, resulting in instantaneous atomization of the cell into an ion cloud with its corresponding elemental composition. Elements found in normal biological samples with a mass of less than 80 atomic mass unit (AMU) are filtered out in the quadrupole, and the remaining rare earth metals coupled with specific antibodies are measured using a time-of-flight analyzer. (C) The true model and the observed data. Left: a schematic representation of the CyTOF model. Two new types of cells are marked by 50 different biomarkers. There exists a true data-generating mechanism M driven by some parameter θ=(θ1,θ2,...,θ50), where θi determines whether the ith biomarker tags one cell type, both cell types, or neither (see text for details). We do not know about M or θ. Right: starting from the data, one proposes a statistical model to discover, via a hypothesis test, significant biomarkers that can distinguish the two cell types. Parts of (B) and (C) were drawn using BioRender.
Figure 3
Figure 3
The p value and related concepts (A) Calculating the p value (see text for details). (B) Significance level (type I error), type II error, and power. The significance level (type I error or α) is a predetermined value (say 0.05), which quantifies the probability of observing extreme values given that the null hypothesis is true (red shades). The type II error (or β) quantifies the probability of failing to reject the null hypothesis given that the alternative hypothesis is true (blue shades). The power (or 1β) quantifies the probability of rejecting the null hypothesis given that the alternative hypothesis is true (dashed shades). The value 1α quantifies the probability of failing to reject the null hypothesis when it is true (represented by the white area, not completely shown, under the null hypothesis curve). (C) The frequentist perspective of the p value. In the frequentist view of hypothesis testing, the parameter is considered as an unknown constant rather than a random variable. (D) Bayesian perspective of evidence seeking. Suppose the prior knowledge weakly supports the null hypothesis H0:μμ0 (with a mean μπ that sits slightly left of μ0), and the likelihood function has a center x¯n that is far right of μ0. Then, the posterior mean μn is pulled, after seeing the data, in a direction rightward away from μπ and toward μ0 and beyond; the farther the center of the likelihood function is from μ0 (namely, the more evidence the data provide against the null), the farther the posterior mean μn is pulled rightward away from μ0, and there is, therefore, stronger a posteriori evidence supporting the alternative hypothesis. (E) The three-world system—the physical world, the Platonic mathematical world, and the mental world—and our modification of it. The physical world represents the entire universe (from every chemical element to every individual) and contains properties that are not readily accessible to the observer. Some of these properties are governed by and/or can be explained using mathematical principles. The mathematical principles translate into (mental) understanding and form one’s perspective about the physical world. (F) The role of the p value in making scientific enquires. Consider an example where a clinician was making inquiries into the prevalence of a disease in a specific age group (i.e., a specific population). Suppose the clinician considered a null hypothesis where the prevalence was 10% (in the population). Because measuring the prevalence of a disease in a population was impractical, the clinician selected a random sample of 10 individuals from the population falling in that age group (left arrow) and found that two had the disease (top circle). The clinician then conducted a hypothesis test that generated a p value of 0.26 (right arrow) and used this to make inferences about the population (bottom arrow). Given the p value, the clinician concluded that there was not enough evidence (at a significance level of 0.05) from the sample that would reject the null hypothesis (made about the population).
Figure 4
Figure 4
A few key useful roles of the p value From left to right: (A) It underpins a simple and clear decision-making system that has been accepted by broad scientific, clinical, and medical communities. Phase I is primarily aimed at safety and tolerability and, in a second order, on pharmacokinetics and pharmacodynamics. In phase II, the study is most of the time not powered for a clinical endpoint but rather for a biomarker. Phase III must indeed be significant. §For drug approval, significance is important, but also safety issues and effect size. (B) It provides a common, and straightforward rule that guides multiple experimenters to evaluate and compare findings based on respective p values and a pre-agreed significance level. (C) It evaluates the outcomes of a test on a continuous scale. (D) It allows integrating results from multiple studies and datasets (see “The pooling of p values via meta-analysis?”). (E) It facilitates causal inquiries and provides a metric to evaluate and determine the existence and strength of potential causation (see “The roles of the p value in causal inference, feature selection, and predictive modeling” for more details).
Figure 5
Figure 5
The roles of hypothesis testing and the p value in making causal inquiries, feature selection, and predictive modeling (A) Estimation of a causal effect. The average causal effect in a randomized study can be identified and quantified using the difference between the expected outcome of the treatment group and the control group and can subsequently be examined via a p value. (B) Out-of-sample test. The model performance or the causal effect estimated from one dataset, when not validated, may be exaggerated or overfit the dataset. Out-of-sample testing can, to a certain degree, alleviate overfitting by training the model using a subset of the data (left) and testing it in the remaining, previously unseen, data (center). Additional testing using data from another study or demographic distinctive sample may further support the generalization of the trained model and its suggested causal claims (right). The p value is critical to evaluate whether the tests are successful, thereby guarding their validity and efficacy. (C) Graphical causal reasoning. The directed arrows (called edges) indicate potential causation. The figure gives a schematic example of the potential directed causal flows in the brain when performing moving object recognition. When one views a moving object, areas in the visual cortex, including V1, V3, and V4, first receive input from the pulvinar nucleus (PN) and lateral geniculate nucleus (LGN) (left). Subsequently, V1 sends signals to V3 (which processes dynamic form recognition) and V4 (which processes color recognition), and through V3, sends information to the prefrontal cortex (center). Finally, there is reverse feedback from V3 and V4 to V1 (right). (D) Causal alternation. If altering the cause (while controlling for covariates) results in a change in the outcome, then it suggests that the stimuli cause the change in the outcome. The figure gives an example of deep brain stimulation (DBS), where, when applying DBS to a target brain region, the brain patterns of the area change accordingly, which then modifies (behavioral) symptoms. DBS is used in treating severe Parkinson's disease (PD). (E) The method of instrumental variable (IV). When directly altering causes or randomization is unavailable, one can consider the method of IV. Someone is interested in studying whether a head injury causes risky behavior. On the one hand, randomization or assigning a head injury is impossible; on the other hand, it could be argued that reverse causation, where risky behavior causes a head injury, is also possible. By using an IV (i.e., wearing a helmet), one can then study whether a head injury causes risky behavior. Suppose one assumes that wearing a helmet is unlikely to cause risky behavior (in the long term), and it is likely to reduce (the chance of getting) a head injury. If introducing wearing helmets reduces risky behavior (while controlling for all other variables, such as age and gender), then it suggests that wearing helmets reduces head injury, which reduces risky behavior. (F) The role of p values in feature selection and predictive modeling. From left to right: each box refers to a brain region; boxes with the same color but different hues indicate the same anatomical or functional brain area. Hypothesis testing between brain data and clinical (categorical, continuous, and longitudinal) outcomes yields a whole-brain p value map. Based on the p values, one can select features (biomarkers); the orange dots indicate selected (significant) features. These features, when coupled with estimated weights (not shown), can be used to predict categorical, continuous, or longitudinal outcomes in previously unseen subjects.
Figure 6
Figure 6
The paradoxes of the p value (A) The associations between the p value, the sample size, and the significance level. The figure shows that the p value goes down as the sample sizes increase. The paradox lies in that, given a particular significance level (say 0.05), one can increase the size of the sample to obtain a p value that is significant. (B) Even if the significance level is lowered (to, say, 0.005), one could keep increasing the sample size to obtain a significant p value. On the other hand, with a fixed sample size, one may adjust the significance level to “control” whether the result is significant. (C) The paradox between the p value, the sample size, and statistical power. A larger sample size may yield a more significant p value with a small effect size, but it also increases power. (D) Reducing the significance level (say, from 0.05 to 0.005) may produce more conservative testing results, but it reduces power. (A)–(D) demonstrate, from different perspectives, why the interpretation of the p value needs to be contextual.
Figure 7
Figure 7
Making better use of the p value (A) A typical flowchart for conducting hypothesis-led testing of, for example, whether the correlation between two random variables is significantly different from zero. A significant correlation, however, does not equate causation. Note that this framework forms the first part of the flowchart in (B). (B) A more rigorous flowchart. We use the correlation test as an example, which can be replaced with other models or tests. It can also extend to cases involving more than two variables. For demonstration, we focus on testing linear causation and abbreviate the procedure for testing non-linear causation (which is marked with two parallel bars; interested readers can refer to Bai et al. and Hiemstra and Jones98). The illustration demonstrates that even simple analysis needs additional caution when causal inference and reproducibility are concerned. Such a flowchart, however, is not the only way to perform hypothesis testing; rather, we show that a more streamlined pipeline may help remove confounding effects, avoid overfitting, and facilitate reproducible research. A careful experimental design, appropriate data processing, and contextual scientific interpretation (not shown) are also important.
Figure 8
Figure 8
An illustration of Bayesian posterior evidence (A) The behavior of the posterior mean. (B) The behavior of the posterior standard deviation. (C) The behavior of posterior evidence for H0:μ109. (D) The behavior of posterior evidence for H0:μ111. See text for explanations.

References

    1. Hume D. John Noon; 1738. A Treatise of Human Nature.
    1. Nagel E. Probability and the theory of knowledge. Philos. Sci. 1939;6:212–253.
    1. von Neumann J., Morgenstern O. Princeton University Press; 1947. Theory of Games and Economic Behavior.
    1. Tversky A., Koehler D.J. Support theory: A nonextensional representation of subjective probability. Psychol. Rev. 1994;101:547–567.
    1. de Finetti B. Probabilism - A critical essay on the theory of probability and on the value of science. Erkenntnis. 1989;31:169–223.