Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Mar 12;10(1):4569.
doi: 10.1038/s41598-020-61288-5.

Forecasting risk gene discovery in autism with machine learning and genome-scale data

Affiliations

Forecasting risk gene discovery in autism with machine learning and genome-scale data

Leo Brueggeman et al. Sci Rep. .

Erratum in

Abstract

Genetics has been one of the most powerful windows into the biology of autism spectrum disorder (ASD). It is estimated that a thousand or more genes may confer risk for ASD when functionally perturbed, however, only around 100 genes currently have sufficient evidence to be considered true "autism risk genes". Massive genetic studies are currently underway producing data to implicate additional genes. This approach - although necessary - is costly and slow-moving, making identification of putative ASD risk genes with existing data vital. Here, we approach autism risk gene discovery as a machine learning problem, rather than a genetic association problem, by using genome-scale data as predictors to identify new genes with similar properties to established autism risk genes. This ensemble method, forecASD, integrates brain gene expression, heterogeneous network data, and previous gene-level predictors of autism association into an ensemble classifier that yields a single score indexing evidence of each gene's involvement in the etiology of autism. We demonstrate that forecASD has substantially better performance than previous predictors of autism association in three independent trio-based sequencing studies. Studying forecASD prioritized genes, we show that forecASD is a robust indicator of a gene's involvement in ASD etiology, with diverse applications to gene discovery, differential expression analysis, eQTL prioritization, and pathway enrichment analysis.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Figure 1
Figure 1
Overview and performance of forecASD. forecASD (a) is a random forest ensemble of features derived from the BrainSpan developmental transcriptome, the STRING network, and several previously published ASD gene prediction methods. Using SFARI 1 and 2 genes as the positive class and 1,000 background genes as the negative class, class predictions of 17,957 genes yields a ranked list with values between 0 and 1. Ranked genes are split by decile and tested for an enrichment (b) of multiple classes of de novo mutations derived from three independent ASD cohorts (MSSNG, SPARK pilot, ASC). The top decile synonymous mutation rate (0.196) in probands (bi.) is used as the expected proportion of mutations within the top decile of genes ranked by forecASD. A binomial test (fraction and p-value written below mutation type) of the enrichment of genes affected by recurrent LOF, recurrent missense, and singleton LOF mutations in the top decile showed a significant enrichment (bii–iv.).
Figure 2
Figure 2
Performance comparison of forecASD with competing methods in prioritizing genes hit by recurrent de novo loss-of-function mutations across three independent ASD cohorts. Bar height and color indicate the fraction of genes with recurrent LOF mutations (n = 44) within the given decile of genes ranked by each score. A binomial test was used to assess for an enrichment of genes with recurrent LOF mutations in the top decile of each score, using each scores decile enrichment for proband synonymous mutations as a baseline (result listed in parentheses below score name).
Figure 3
Figure 3
forecASD prioritizes ASD genetic signal across diverse data types. The majority of novel ASD genes recently identified in two ASD sequencing studies by TADA (not part of forecASD’s training set) fall within the top decile of forecASD (a). The forecASD score is significantly negatively correlated with ASD postmortem differential expression levels (b). Higher scoring forecASD genes are enriched for brain eQTL that also have low p-values (nominal < 0.01) in ASD GWAS (c).
Figure 4
Figure 4
Pathway analysis of genes highlighted by forecASD. When testing the top-decile genes according to forecASD for Reactome pathway enrichment, pathways emerged that were represented, but not enriched in the SFARI HC list (a). In panel a, for each pathway, the top bar represents the number of genes (number) and the enrichment (color) for that pathway in top decile forecASD genes, while the bottom represents the enrichment for that pathway in SFARI HC genes. Other pathways were highly enriched in forecASD genes that were not represented at all in the SFARI HC list, even though they have associated literature suggesting a role in autism (b). forecASD is more sensitive than SFARI HC to pathways that are differentially regulated in the brains of individuals with autism, particularly in ASD-upregulated pathways (c), but also in downregulated pathways (d). Using the top decile of TADA (− log10 FDR) genes showed similar sensitivity to SFARI HC (not shown), suggesting that rare variant approaches may be less sensitive in implicating genes found through gene expression studies.
Figure 5
Figure 5
Clustering of genes highlighted by forecASD into distinct modules. Greedy hierarchical optimization of the modularity score yielded 17 clusters consisting of 1,452 forecASD genes (a). All clusters have several significantly enriched biological pathways, of which the top terms were labeled. Clusters were tested for significance of overlap with the list of SFARI HC genes (b), and enrichment of haploinsufficiency genes (pLI > 0.5; c).

Similar articles

Cited by

References

    1. Rosenberg RE, et al. Characteristics and concordance of autism spectrum disorders among 277 twin pairs. Archives of Pediatrics & Adolescent Medicine. 2009;163:907. doi: 10.1001/archpediatrics.2009.98. - DOI - PubMed
    1. Colvert E, et al. Heritability of autism spectrum disorder in a UK population-based twin sample. JAMA Psychiatry. 2015;72:415. doi: 10.1001/jamapsychiatry.2014.3028. - DOI - PMC - PubMed
    1. Rubeis SD, et al. Synaptic, transcriptional and chromatin genes disrupted in autism. Nature. 2014;515:209–215. doi: 10.1038/nature13772. - DOI - PMC - PubMed
    1. Abrahams BS, et al. SFARI gene 2.0: a community-driven knowledgebase for the autism spectrum disorders (ASDs) Molecular Autism. 2013;4:36. doi: 10.1186/2040-2392-4-36. - DOI - PMC - PubMed
    1. Iossifov I, et al. The contribution of de novo coding mutations to autism spectrum disorder. Nature. 2014;515:216–221. doi: 10.1038/nature13908. - DOI - PMC - PubMed

Publication types

Substances