Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2025 Mar 6:2025.03.05.25323315.
doi: 10.1101/2025.03.05.25323315.

GA4GH Phenopacket-Driven Characterization of Genotype-Phenotype Correlations in Mendelian Disorders

Affiliations

GA4GH Phenopacket-Driven Characterization of Genotype-Phenotype Correlations in Mendelian Disorders

Lauren Rekerle et al. medRxiv. .

Abstract

Comprehensively characterizing genotype-phenotype correlations (GPCs) in Mendelian disease would create new opportunities for improving clinical management and understanding disease biology. However, heterogeneous approaches to data sharing, reuse, and analysis have hindered progress in the field. We developed Genotype Phenotype Evaluation of Statistical Association (GPSEA), a software package that leverages the Global Alliance for Genomics and Health (GA4GH) Phenopacket Schema to represent case-level clinical and genetic data about individuals. GPSEA applies an independent filtering strategy to boost statistical power to detect categorical GPCs represented by Human Phenotype Ontology terms. GPSEA additionally enables visualization and analysis of continuous phenotypes, clinical severity scores, and survival data such as age of onset of disease or clinical manifestations. We applied GPSEA to 85 cohorts with 6613 previously published individuals with variants in one of 80 genes associated with 122 Mendelian diseases and identified 225 significant GPCs, with 48 cohorts having at least one statistically significant GPC. These results highlight the power of standardized representations of clinical data for scalable discovery of GPCs in Mendelian disease.

PubMed Disclaimer

Conflict of interest statement

Competing interests The authors declare no competing interests

Figures

Fig. 1:
Fig. 1:. Schematic overview of GPSEA workflow.
a) Overview. GPSEA is a Python package designed to work well in Jupyter notebooks. GPSEA takes a collection of GA4GH phenopackets as input, performs quality assessment and visualizes the salient characteristics of the cohort; genotype classes are defined (Figure 2); and one of four classes of statistical test is performed for each hypothesis the user decides to test. b) Visualize data and formulate hypotheses. GPSEA displays tables with the distribution of phenotypic abnormalities, disease diagnoses, variants, and other information, and presents a cartoon with the distribution of variants across the protein. This information intends to help users formulate hypotheses about genotype-phenotype correlations (GPCs). c) Statistical testing. GPSEA offers four main ways of testing phenotypes (See text for details and Figure 4 for examples).
Fig. 2:
Fig. 2:. Variant predicates and genotype classifiers. a, Variant predicate tests.
GPSEA provides predicate functions that test if a variant meets a criterion from one of three evidence groups: allele, functional annotation or protein. For instance, the predicate checks if the variant is a deletion, if it overlaps with a specific exon or with a protein region of interest. b, Boolean algebra. Variant predicates can be combined using AND, OR, and NOT operators of Boolean algebra to test complex criteria. For instance, a predicate for a point mutation can be formulated as a “missense mutation affecting one reference base and change length of zero” (no sequence loss or gain). A predicate for a loss-of-function mutation can be defined as a mutation leading to a transcript ablation, frameshift, introduction of a premature stop codon or the start codon loss. A predicate for a structural deletion can test if the variant is either an imprecise chromosomal deletion or a deletion involving 50 or more base pairs (or other thresholds). c, Genotype classifiers. Each classifier splits a cohort into two or more classes to enable genotype-phenotype comparisons. GPSEA ships with five built-in classifiers to classify the cohort members using their sex, diagnosis, a fixed count of alleles of different types (Monoallelic and Biallelic), or by a different allele count of the same type (Allele count).
Fig. 3:
Fig. 3:. Independent-Filtering for Human Phenotype Ontology.
Independent filtering for HPO (IF-HPO) removes hypotheses (here, HPO terms) by criteria independent of the test statistic to reduce the multiple testing burden and boost power. The HPO has a hierarchical structure going from general to specific terms. (1) IF-HPO does not test the top two levels of the HPO under the Phenotypic abnormality root or the terms that are not descendants of the Phenotypic abnormality under the assumption that more specific terms are of higher medical and scientific interest and the signal is likely to be driven by a more specific clinical manifestation. (2) Terms are not tested if they have the exact same counts as one of their child terms, because in this case the annotations of the parent term are derived entirely from those of the child term by the true path rule. (3) Terms are not tested if the coverage is less than 40% of the entire cohort (assuming a cohort of 100 individuals in the figure), under the assumption that the result would not be representative for the cohort. (4) Terms are not tested if the total count is below a threshold for reaching the nominal statistical power. (5) Finally, terms are not tested if one of the genotype classes has neither present nor excluded observations.
Fig 4:
Fig 4:. GPC Analysis.
This figure shows excerpted results from five example analyses. a) Visualization. GPSEA generates a cartoon showing the location and frequency of variants in protein sequences. The following panels show examples of statistically significant GPCs identified by GPSEA. b) Categorical analysis. Several phenotypic abnormalities (HPO terms) such as neurofibromas, freckling, Lisch nodules, optic nerve glioma, and scoliosis are significantly less frequent in individuals with neurofibromatosis type 1 due to variants located at the arginine residue at position 1830 of neurofibromin isoform 1 than in those with different mutations (Fisher exact test, IF-HPO, Benjamini-Hochberg correction). Pulmonary stenosis is, however, observed more often in those with Arg1830 mutation. c) Severity score. A box plot with counts of abnormalities in 5 organ systems in the individuals with mutations in RERE showing the association of the mutations in the Atrophin domain with abnormalities in multiple organ systems; (Mann-Whitney U test, p=1.44 × 10−3). The boxes represent the Q1-Q3 range and the whiskers extend to the farthest score lying within 1.5x the interquartile range. The blue line denotes the median score. d) de Vries score. Box plots representing the association of the de Vries phenotype score and missense variants in CHD8 (Mann-Whitney U test, p=8.99 × 10−4) e) Continuous phenotypes. Association of CYP21A2 genotype (homozygous missense vs.other) with concentration of 17-OH progesterone (t-test, p=7.91 × 10−6). f) Survival analysis. Comparison of the onset of Stage 5 chronic kidney disease (HP:0003774) in individuals with UMOD mutations showing a significantly earlier onset of the disease in the individuals with NP_000491.4:p.(Cys248Trp) mutation than in those with p.(Gln316Pro) (Logrank test, p=4.1 × 10–4). Missense: set complement of “missense”, i.e., any mutation that is not missense, LoF: loss-of-function.

References

    1. Ries M. & Gal A. Genotype–phenotype correlation in Fabry disease. in Fabry Disease: Perspectives from 5 Years of FOS (eds. Mehta A., Beck M. & Sunder-Plassmann G.) (Oxford PharmaGenesis, Oxford, 2006). - PubMed
    1. Bettegowda C. et al. Genotype-phenotype correlations in neurofibromatosis and their potential clinical use. Neurology 97, S91–S98 (2021). - PMC - PubMed
    1. MacRae C. A. & Seidman C. E. Closing the Genotype-Phenotype Loop for Precision Medicine. Circulation 136, 1492–1494 (2017). - PMC - PubMed
    1. Robinson P. N. et al. The Human Phenotype Ontology: a tool for annotating and analyzing human hereditary disease. Am. J. Hum. Genet. 83, 610–615 (2008). - PMC - PubMed
    1. Köhler S. et al. The Human Phenotype Ontology in 2017. Nucleic Acids Res. 45, D865–D876 (2017). - PMC - PubMed

Publication types

LinkOut - more resources