Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Jul 29;14(1):81.
doi: 10.1186/s13073-022-01078-y.

X-CAP improves pathogenicity prediction of stopgain variants

Affiliations

X-CAP improves pathogenicity prediction of stopgain variants

Ruchir Rastogi et al. Genome Med. .

Abstract

Stopgain substitutions are the third-largest class of monogenic human disease mutations and often examined first in patient exomes. Existing computational stopgain pathogenicity predictors, however, exhibit poor performance at the high sensitivity required for clinical use. Here, we introduce a new classifier, termed X-CAP, which uses a novel training methodology and unique feature set to improve the AUROC by 18% and decrease the false-positive rate 4-fold on large variant databases. In patient exomes, X-CAP prioritizes causal stopgains better than existing methods do, further illustrating its clinical utility. X-CAP is available at https://github.com/bejerano-lab/X-CAP .

Keywords: Machine learning; Nonsense; Pathogenicity prediction; Stopgain.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
Stopgains are a sizable variant class. a The number of variants of each mutation type as a proportion of all DM (disease-causing) variants in HGMD 2020.1. Single base-pair stopgains are the third-largest class, trailing only missense variants and frameshift indels. b The prevalence of stopgains from Phase 3 of the 1000 Genomes Project (N=2504) as a function of their allele frequencies within the same dataset. The average individual in the dataset harbors 12.5 stopgains with an allele frequency of less than 1%
Fig. 2
Fig. 2
X-CAP features show predictive power. Comparison of feature values for benign and pathogenic stopgains in the training set of Doriginal. a The Residual Variation Intoleration Score (RVIS) decile of genes, weighted by the number of variants they contain. Genes without RVIS values were excluded. Pathogenic variants are more prevalent in low RVIS genes, namely those generally intolerant to variation. b Kernel Density Estimation (KDE) plot of the relative variant location, defined as the distance in the coding domain sequence (CDS) from the translation start site divided by the total CDS length. On average, benign stopgains are located later in transcripts than pathogenic stopgains. c KDE plot of the number of exons in the mutated gene. The maximum number of exons is clipped to 100 for clarity. Genes containing benign stopgains tend to have fewer exons than genes containing pathogenic stopgains. d Odds ratios (pathogenic/benign) comparing variants that introduce a given stop codon to those that do not. The TGA stop codon, molecularly shown to be the most amenable to read-through of the three [36], is depleted in pathogenic variants. e Odds ratios comparing 5’ proximal stopgains (those within the first 100 bp of the sequence) that have a potential alternative downstream start codon a given distance away against those that do not. Pathogenic variants tend to be located further from the next downstream start codon than benign variants. f KDE plot of the mean phyloP of the downstream region, the portion of the CDS truncated by the stopgain. Regions downstream of pathogenic variants are more conserved than regions downstream of benign variants. In b, c, and f, Scott’s Rule [52] was used to calculate the bandwidth of the Gaussian kernel. In d and e, error bars denote 95% confidence intervals for the odds ratio
Fig. 3
Fig. 3
X-CAP outperforms competitors. a For each model, we plot the ROC curve and associated AUROC metric on the test set of Doriginal. X-CAP has the highest AUROC, improving upon the previous state-of-the-art by 0.14 absolute points. The orange and green dotted lines display X-CAP’s performance when trained only on variants present in the databases used by MutPred-LoF and ALoFT, respectively. To ensure a fair comparison, we randomly subsampled these datasets to the size used in the original papers (n indicates the size of the training set). b We enlarge the portion of the plot above the dashed line in panel a to show performance in the clinically relevant, high-sensitivity region (TPR ≥0.95). We also display the hsr-AUROC, which is the normalized area under the curve in the high-sensitivity region. We optimized X-CAP to excel in this region, rather than over the full ROC. At 95% sensitivity, X-CAP correctly classifies 80.0% of benign stopgain variants, over four times more than any other classifier
Fig. 4
Fig. 4
X-CAP eliminates the most benign stopgain VUS in control exomes. We plot the fraction of rare benign stopgain variants that were assigned scores below the 95%-sensitivity threshold for each classifier. These variants were taken from exomes from a control population (N=480) in an Inflammatory Bowel Disease (IBD) study. The performance of all classifiers on exomes nicely matches their performance on aggregated variant sets in Fig. 3b and Additional file 1: Fig. S3b. X-CAP increases the percentage of benign VUS eliminated by 4.4-fold

References

    1. Bamshad MJ, Ng SB, Bigham AW, Tabor HK, Emond MJ, Nickerson DA, Shendure J. Exome sequencing as a tool for Mendelian disease gene discovery. Nat Rev Genet. 2011;12(11):745–55. doi: 10.1038/nrg3031. - DOI - PubMed
    1. Landrum MJ, Lee JM, Benson M, Brown GR, Chao C, Chitipiralla S, Gu B, Hart J, Hoffman D, Jang W, et al. ClinVar: improving access to variant interpretations and supporting evidence. Nucleic Acids Res. 2018;46(D1):1062–7. doi: 10.1093/nar/gkx1153. - DOI - PMC - PubMed
    1. Stenson PD, Mort M, Ball EV, Chapman M, Evans K, Azevedo L, Hayden M, Heywood S, Millar DS, Phillips AD, et al. The Human Gene Mutation Database (HGMD®): optimizing its use in a clinical diagnostic or research setting. Hum Genet. 2020;139(10):1197–207. doi: 10.1007/s00439-020-02199-3. - DOI - PMC - PubMed
    1. Won D-G, Kim D-W, Woo J, Lee K. 3Cnet: pathogenicity prediction of human variants using multitask learning with evolutionary constraints. Bioinformatics. 2021;37(24):4626–34. doi: 10.1093/bioinformatics/btab529. - DOI - PMC - PubMed
    1. Wenger AM, Guturu H, Bernstein JA, Bejerano G. Systematic reanalysis of clinical exome data yields additional diagnoses: implications for providers. Genet Med. 2017;19(2):209–14. doi: 10.1038/gim.2016.88. - DOI - PubMed

Publication types

LinkOut - more resources