Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 May 30;10(1):2373.
doi: 10.1038/s41467-019-10016-3.

Flexible and scalable diagnostic filtering of genomic variants using G2P with Ensembl VEP

Affiliations

Flexible and scalable diagnostic filtering of genomic variants using G2P with Ensembl VEP

Anja Thormann et al. Nat Commun. .

Abstract

We aimed to develop an efficient, flexible and scalable approach to diagnostic genome-wide sequence analysis of genetically heterogeneous clinical presentations. Here we present G2P ( www.ebi.ac.uk/gene2phenotype ) as an online system to establish, curate and distribute datasets for diagnostic variant filtering via association of allelic requirement and mutational consequence at a defined locus with phenotypic terms, confidence level and evidence links. An extension to Ensembl Variant Effect Predictor (VEP), VEP-G2P was used to filter both disease-associated and control whole exome sequence (WES) with Developmental Disorders G2P (G2PDD; 2044 entries). VEP-G2PDD shows a sensitivity/precision of 97.3%/33% for de novo and 81.6%/22.7% for inherited pathogenic genotypes respectively. Many of the missing genotypes are likely false-positive pathogenic assignments. The expected number and discriminative features of background genotypes are defined using control WES. Using only human genetic data VEP-G2P performs well compared to other freely-available diagnostic systems and future phenotypic matching capabilities should further enhance performance.

PubMed Disclaimer

Conflict of interest statement

M.E.H. is a co-founder, consultant and non-executive director of Congenica Ltd. The remaining authors declare no competing interests.

Figures

Fig. 1
Fig. 1
Summary of LGMDET structure and application in diagnostic filtering. a summarizes the components of a LGMDET thread. Each locus-genotype-mechanism-disease-evidence thread (LGMDET) associates an allelic requirement and a mutational consequence at a defined locus with a disease entity and a confidence level and evidence links. The publicly available G2PDD and G2PCancer data can be searched or downloaded on the website (https://www.ebi.ac.uk/gene2phenotype). b gives examples for LGMDE threads from curated datasets. In addition to the publicly available G2PDD and G2PCancer data, G2PEye is actively curated and will be publicly available soon. Access to the curation system can be requested for the creation of user-defined datasets. c summarizes the workflow for diagnostic filtering. The VCF files derived from the next-generation sequence data are passed to VEP which uses Ensembl annotation data to compute and annotate the consequence of each variant. The VEP-G2P plugin runs as an additional step of the VEP analysis. It uses the results of VEP’s computations and annotations together with the knowledge from the LGMDETs to filter the variants from the patients input VCF file. The plugin results, plausible genotypes of likely deleterious variants, are returned together with the VEP output file for clinical review. d lastly, the combined analysis of running VEP and VEP-G2P is repeated for a control cohort. The comparison between results from a population unselected for disease with the results from a disease cohort yields the expected background to quantify diagnostic noise and to identify discriminating features between the two cohorts
Fig. 2
Fig. 2
Diagnostically discriminative VEP-G2P disease-specific output. VEP-G2P analysis of three independent WES cohorts; DDD (n = 7357), CRC (n = 517) and GS (n = 315). a Odds ratios for samples carrying at least one valid G2P variant (passing the G2P criteria and on a canonical transcript) in 454 unique G2PDD monoallelic genes: DDD vs GS (red) and CRC vs GS (black); two-tail Fisher’s Exact Test: *p-value ≤ 5 × 10-2, **p-value ≤ 5 × 10-3, ***p-value ≤ 5 × 10-6, n.s not significant; considering only missense variants where SIFT and PolyPhen agree deleterious/damaging. b Odds ratios for samples carrying at least one valid G2P variant in 950 different G2PDD biallelic genes. No stop_lost and inframe_insertion variants were found in the GS cohort and few in DDD or CRC (p-value > 5 × 10−2). Error bars = 95% confidence intervals (CI) in a and b. c Proportion of individuals in the three cohorts (y-axis) carrying a particular number of LOF and missense (regardless of their SIFT/PolyPhen status and CADD score) variants reported by VEP-G2PDD (x-axis). The proportion of DDD individuals for which no VEP-G2PDD hit is found is significantly lower compared to CRC and GS cohorts, both for monoallelic (p-values for two-tail Fisher’s Exact Test comparing number of individuals for which no variants is found to those for which at least one variant is found: DDD vs GS = 7.9e-09, DDD vs CRC = 2.3e-12, CRC vs GS = 0.93) and biallelic genes (DDD vs GS = 1.5e-10, DDD vs CRC = 1.5e-11, CRC vs GS = 0.39). DDD (n = 7357 individuals), CRC (n = 517), GS (n = 315). d DDD cohort is significantly enriched for unique missense variants with CADD > 30 in G2PDD genes (top) compared to GS (p-value two-tail Fisher’s Exact Test = 0.005); with no significant difference between DDD and CRC (p-value = 0.17) and CRC and GS (p-value = 0.16). There is no significant difference for the proportion of unique missense variants with CADD > 30 in the CRC and GS cohorts in G2PCancer genes (bottom, p-value = 1.0)
Fig. 3
Fig. 3
Sensitivity and precision of VEP-G2P Analysis GQ genotype quality, MAF minor allele frequency, alt:ref the ratio of alternate to reference alleles. TP true positive, FP false positive, TN true negative, FN false negative. Sensitivity = TP/(TP + FN). Precision = TP / (TP + FP). a Evaluation of G2P accuracy for likely causative variants in 94 genes achieving genome-wide significance (GWS) for de novo mutations in the DDD study. b G2P accuracy against the set of variants previously identified by DDD in the first 1133 samples, excluding de novo mutations. c ROC curves for VEP-G2PDD performance on 1700 DDD probands with de novo mutations (DNM) identified in the 484 monoallelic genes. The points on the curves represent varying MAF cut-offs: not seen in any control databases (bottom left), MAF < 1:100000, MAF < 1:50000, MAF < 1:25000, MAF < 1:10000 (top right). The region in the top left corner of the ROC space graph has been expanded to scale using the regions bounded by the dashed line rectangles. d The effect of consequence type and MAF on precision and recall (PR curves) of VEP-G2PDD using the same data analysed for the ROC space in c. The highest precision [0.812, 0.863] is achieved for LOF variants but with the lowest recall [0.425, 0.437]. The highest recall is achieved for variants of all consequence types [0.897, 0.942] at the cost of decreased precision [0.334, 0.476]. Analysing only missense variants with CADD≥ 30 or CADD ≥  20 leads to improvements in precision at the cost of decreasing recall
Fig. 4
Fig. 4
Comparison of VEP-G2P to existing tools. a Comparison for 100 random DDD samples with dominant de novo sample (left panel) and 100 random DDD recessive samples (right panel). b Comparison for 100 unaffected Generation Scotland (GS) samples. Each dot represents the rank of the causative gene in the output of the tools, where rank of 1 indicates the causative gene is at the top of the list reported by the tool; boxplots show the median (centre line), the first and third quartiles (box bounds), whiskers represent 1.5∗ interquartile range from the first/third quartiles. F1 = 2∗((precision∗recall)/(precision + recall))

References

    1. Brandsema JF, Darras BT. Dystrophinopathies. Semin. Neurol. 2015;35:369–384. doi: 10.1055/s-0035-1558982. - DOI - PubMed
    1. Parikh S, et al. A clinical approach to the diagnosis of patients with leukodystrophies and genetic leukoencephelopathies. Mol. Genet Metab. 2015;114:501–515. doi: 10.1016/j.ymgme.2014.12.434. - DOI - PMC - PubMed
    1. Biesecker LG. Exome sequencing makes medical genomics a reality. Nat. Genet. 2010;42:13–14. doi: 10.1038/ng0110-13. - DOI - PubMed
    1. Choi M, et al. Genetic diagnosis by whole exome capture and massively parallel DNA sequencing. Proc. Natl Acad. Sci. USA. 2009;106:19096–19101. doi: 10.1073/pnas.0910672106. - DOI - PMC - PubMed
    1. Ng SB, et al. Exome sequencing identifies the cause of a mendelian disorder. Nat. Genet. 2010;42:30–35. doi: 10.1038/ng.499. - DOI - PMC - PubMed

Publication types