Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Apr 29;18(4):e1010191.
doi: 10.1371/journal.pgen.1010191. eCollection 2022 Apr.

Classification of non-coding variants with high pathogenic impact

Affiliations

Classification of non-coding variants with high pathogenic impact

Lambert Moyon et al. PLoS Genet. .

Abstract

Whole genome sequencing is increasingly used to diagnose medical conditions of genetic origin. While both coding and non-coding DNA variants contribute to a wide range of diseases, most patients who receive a WGS-based diagnosis today harbour a protein-coding mutation. Functional interpretation and prioritization of non-coding variants represents a persistent challenge, and disease-causing non-coding variants remain largely unidentified. Depending on the disease, WGS fails to identify a candidate variant in 20-80% of patients, severely limiting the usefulness of sequencing for personalised medicine. Here we present FINSURF, a machine-learning approach to predict the functional impact of non-coding variants in regulatory regions. FINSURF outperforms state-of-the-art methods, owing in particular to optimized control variants selection during training. In addition to ranking candidate variants, FINSURF breaks down the score for each variant into contributions from individual annotations, facilitating the evaluation of their functional relevance. We applied FINSURF to a diverse set of 30 diseases with described causative non-coding mutations, and correctly identified the disease-causative non-coding variant within the ten top hits in 22 cases. FINSURF is implemented as an online server to as well as custom browser tracks, and provides a quick and efficient solution to prioritize candidate non-coding variants in realistic clinical settings.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. FINSURF design strategy.
a. Percentage of genetic variants intersecting GENCODE biotypes across benign variants (shades of blue, corresponding to different sampling strategies) and damaging variants from the HGMD database (red). b. The final pipeline leading to the FINSURF model. Control negative variants were sampled using the Adjusted strategy. Both the negative and positive sets were annotated with 41 features, and a random forest classifier was trained to distinguish them on this basis. Ten iterations were performed, each time using 9/10 of the data, while testing performances on the remaining 1/10 which had not been used for training.
Fig 2
Fig 2. FINSURF performances.
a. Receiving Operating Curve (ROC) after a 10-fold training procedure. The average curve is shown in bold red and the 95% confidence interval is indicated by a pink shading, with the mean Area Under Curve (AUC) reported in the bottom right. The dashed diagonal line indicates the distinction between positives and negatives expected by chance (AUC = 0.50). b. Precision Recall Curve (PRC) computed from the same 10-fold training procedure. As for the ROC, the average curve is shown in bold red and the 95% interval is indicated by a pink shading, with the mean Area Under Curve (AUC) reported at the bottom. The dashed diagonal line indicates the amount of true positive to be recovered by a model predicting all variants as positive, fixed to 12.5%. c. Distributions of FINSURF scores in the test set for each of the 10-fold trainings. Scores for negative variants are shown in blue, and for positive variants in red. The vertical dashed line represents the optimal score threshold (0.51) to separate positives from negatives (Material and Methods). d. ROC curves comparisons between FINSURF and eight other methods on a set of 62 variant independent from the training set of FINSURF. AUC values for each method are indicated in the legend.
Fig 3
Fig 3. Feature contributions.
a. The 880 positive variants were clustered using K-means into 7 clusters based on the contributions of all 41 features to their FINSURF score. Variants were classified as true positives or false negatives using the optimal score threshold (0.51). b. Average feature contributions in each cluster. The grey-red gradient reflects the normalized contribution of each feature and is relative across the entire grid. Features are grouped by functionally relevant categories (denoted by green, purple, red and blue colours). c. Functional profile of a True Positive variant, characterized as a disease-causing mutation impacting the SERPINC1 promoter. The heights of bars represent each of the features, rescaled between -1 and 1 from their distribution over the 400Mb of regulatory regions. The colours represents the feature contributions, highlighting which feature contributed positively (red) or negatively (blue) to the prediction score. d. Functional profile of a False Positive variant, passing the optimal threshold of 0.51, and found in regulatory regions also associated to SERPINC1.
Fig 4
Fig 4. Application to medical genetics.
a. A set of 49 regulatory variants causing human diseases (x-axis) not used for training were scored by FINSURF (y-axis). Eleven variants target a disease gene that is also targeted by a training variant (in blue), while 38 variants are totally independent (in purple). b. The 49 variants were seeded amongst over 4 million variants from a representative, otherwise healthy individual human genome, and their respective ranks are shown in the top bar (log scale; colors represent different diseases). When pathogenic and background variants are restricted to putatively functional non-coding sequences based on molecular or evolutionary evidence, ranking remains uninformative (second bar). However, when filtering for variants associated with disease genes, disease-causing mutations generally show high-ranking positions (coloured bars; total number of non-coding variants associated each disease indicated on the left; pathogenic variants highlighted in dark, with their rank above). c. Detailed genomic context for a non-coding mutation causing van der Woude syndrome 1 (MIM 119300), which is located in an enhancer ~30 kb in 5’ to the TSS of its target gene, interferon regulatory factor 6 (IRF6). Gene associations are from the GeneHancer collection, and depict the enhancer (green horizontal bar) with the link to its predicted target gene (dashed arc). All tracks are from the UCSC genome browser.

References

    1. Osterwalder M, Barozzi I, Tissières V, Fukuda-Yuzawa Y, Mannion BJ, Afzal SY, et al. Enhancer redundancy provides phenotypic robustness in mammalian development. Nature. 2018;554: 239–243. doi: 10.1038/nature25461 - DOI - PMC - PubMed
    1. Gordon CT, Lyonnet S. Enhancer mutations and phenotype modularity. Nat Genet. 2014;46: 3–4. doi: 10.1038/ng.2861 - DOI - PubMed
    1. Mohammadi P, Castel SE, Cummings BB, Einson J, Sousa C, Hoffman P, et al. Genetic regulatory variation in populations informs transcriptome analysis in rare disease. Science. 2019;366: 351–356. doi: 10.1126/science.aay0256 - DOI - PMC - PubMed
    1. Short PJ, McRae JF, Gallone G, Sifrim A, Won H, Geschwind DH, et al. De novo mutations in regulatory elements in neurodevelopmental disorders. Nature. 2018;555: 611–616. doi: 10.1038/nature25983 - DOI - PMC - PubMed
    1. Stenson PD, Mort M, Ball EV, Evans K, Hayden M, Heywood S, et al. The Human Gene Mutation Database: towards a comprehensive repository of inherited mutation data for medical research, genetic diagnosis and next-generation sequencing studies. Hum Genet. 2017;136: 665–677. doi: 10.1007/s00439-017-1779-6 - DOI - PMC - PubMed

Publication types