Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012 May 15;160C(2):130-42.
doi: 10.1002/ajmg.c.31330. Epub 2012 Apr 12.

Network- and attribute-based classifiers can prioritize genes and pathways for autism spectrum disorders and intellectual disability

Affiliations

Network- and attribute-based classifiers can prioritize genes and pathways for autism spectrum disorders and intellectual disability

Yan Kou et al. Am J Med Genet C Semin Med Genet. .

Abstract

Autism spectrum disorders (ASD) are a group of related neurodevelopmental disorders with significant combined prevalence (∼1%) and high heritability. Dozens of individually rare genes and loci associated with high-risk for ASD have been identified, which overlap extensively with genes for intellectual disability (ID). However, studies indicate that there may be hundreds of genes that remain to be identified. The advent of inexpensive massively parallel nucleotide sequencing can reveal the genetic underpinnings of heritable complex diseases, including ASD and ID. However, whole exome sequencing (WES) and whole genome sequencing (WGS) provides an embarrassment of riches, where many candidate variants emerge. It has been argued that genetic variation for ASD and ID will cluster in genes involved in distinct pathways and protein complexes. For this reason, computational methods that prioritize candidate genes based on additional functional information such as protein-protein interactions or association with specific canonical or empirical pathways, or other attributes, can be useful. In this study we applied several supervised learning approaches to prioritize ASD or ID disease gene candidates based on curated lists of known ASD and ID disease genes. We implemented two network-based classifiers and one attribute-based classifier to show that we can rank and classify known, and predict new, genes for these neurodevelopmental disorders. We also show that ID and ASD share common pathways that perturb an overlapping synaptic regulatory subnetwork. We also show that features relating to neuronal phenotypes in mouse knockouts can help in classifying neurodevelopmental genes. Our methods can be applied broadly to other diseases helping in prioritizing newly identified genetic variation that emerge from disease gene discovery based on WES and WGS.

PubMed Disclaimer

Figures

Figure 1
Figure 1. Identification of (A) ID genes in the ASD genes neighborhood, (B) ASD genes in the ID genes neighborhood or, (C) LOOCV of the ASD+ID genes neighborhood
Shortest path distance Di (left) and MFPT score Sj (right) were computed for each node in the PPI network. The number of ID or ASD genes identified in each neighborhood within the specified cutoff range is shown on the left and the leave-one-out cross validation (LOOCV) of the seed gene lists is shown on the right.
Figure 2
Figure 2. Non-cumulative percentages of identified non-seed gene hits per shortest path distance and MFPT score ranking in (A) ASD, (B) ID, or (C) ASD+ID neighborhoods
The green frames show Di and Sj scores chosen arbitrarily as disease neighborhoods.
Figure 3
Figure 3. ROC curves and AUC analysis for the identification of (A) ASD genes, or (B) ID genes in ID or ASD gene neighborhoods
The Di of each gene to the seed list was calculated and the ROC curve was plotted by increasing the cutoff distance by steps of 0.05, starting from the minimum distance of all genes in the network. True positive rate (TPR) was defined as the proportion of genes from the inquiring list with Di shorter than the cutoff distance over the total number of genes in the list and false positive rate (FPR) the proportion of genes with Di shorter than the cutoff distance but not in the inquiring list over total number of genes not in the list. 50 lists were generated for each control type for comparison, as shown in different colors. The mean FPR and TPR for the 50 control lists were used to plot the ROC curve. In the AUC section, t-test statistics was performed with the null hypothesis that the AUC of ASD/ID genes identification can be achieved with random gene lists of each type. P value <0.0001 is indicated as double stars (**), and <0.01 as single star (*). The ROC curve of Sj was plotted in the same way by increasing the cutoff rank by one gene.
Figure 4
Figure 4. ROC curves and AUC analysis of the SVM classifiers of (A) ASD or (B) ID genes
The classifiers are trained and tested by 10-fold cross-validation using seed genes and different types of control gene lists with the same size. An average ROC curve for the 10 folds for each classifier is plotted. Inset plots show the average AUC with standard deviation for each classifier.
Figure 5
Figure 5. Genes identified using the three classifiers and their connections using functional associated networks
The shortest path distance of 3.95 and 3.65 (shown in Fig. 2) were applied as cutoff for the identification of (A) ASD genes or (B) ID genes, respectively. The number of SVM retrieved genes is the intersection of genes retrievable by all six classifiers trained by different types of control gene lists. The 39 ASD genes and 59 ID genes identified in all three classifiers, as well as the 39+59 genes are connected using functional associated networks with the software Genes2FANs (http://actin.pharm.mssm.edu/genes2FANs) and direct interactions are shown in (C) for the 39 ASD genes, (D) 59 ID genes and (E) 39+59 combined genes.

References

    1. Bader GD, Betel D, Hogue CWV. BIND: the Biomolecular Interaction Network Database. Nucl Acids Res. 2003;31(1):248–250. - PMC - PubMed
    1. Berger S, Posner J, Ma'ayan A. Genes2Networks: connecting lists of gene symbols using mammalian protein interactions databases. BMC Bioinformatics. 2007;8(1):372. - PMC - PubMed
    1. Berger SI, Ma'ayan A, Iyengar R. Systems pharmacology of arrhythmias. Sci Signal. 2010;3(118) ra30. - PMC - PubMed
    1. Betancur C. Etiological heterogeneity in autism spectrum disorders: More than 100 genetic and genomic disorders and still counting. Brain Research. 2010;1380(0):42–77. - PubMed
    1. Beuming T, Skrabanek L, Niv MY, Mukherjee P, Weinstein H. PDZBase: a protein-protein interaction database for PDZ-domains. Bioinformatics. 2005;21(6):827–828. - PubMed

Publication types

MeSH terms

Substances