Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Apr 26;14(1):40.
doi: 10.1186/s13073-022-01042-w.

Large-scale discovery of novel neurodevelopmental disorder-related genes through a unified analysis of single-nucleotide and copy number variants

Affiliations

Large-scale discovery of novel neurodevelopmental disorder-related genes through a unified analysis of single-nucleotide and copy number variants

Kohei Hamanaka et al. Genome Med. .

Abstract

Background: Previous large-scale studies of de novo variants identified a number of genes associated with neurodevelopmental disorders (NDDs); however, it was also predicted that many NDD-associated genes await discovery. Such genes can be discovered by integrating copy number variants (CNVs), which have not been fully considered in previous studies, and increasing the sample size.

Methods: We first constructed a model estimating the rates of de novo CNVs per gene from several factors such as gene length and number of exons. Second, we compiled a comprehensive list of de novo single-nucleotide variants (SNVs) in 41,165 individuals and de novo CNVs in 3675 individuals with NDDs by aggregating our own and publicly available datasets, including denovo-db and the Deciphering Developmental Disorders study data. Third, summing up the de novo CNV rates that we estimated and SNV rates previously established, gene-based enrichment of de novo deleterious SNVs and CNVs were assessed in the 41,165 cases. Significantly enriched genes were further prioritized according to their similarity to known NDD genes using a deep learning model that considers functional characteristics (e.g., gene ontology and expression patterns).

Results: We identified a total of 380 genes achieving statistical significance (5% false discovery rate), including 31 genes affected by de novo CNVs. Of the 380 genes, 52 have not previously been reported as NDD genes, and the data of de novo CNVs contributed to the significance of three genes (GLTSCR1, MARK2, and UBR3). Among the 52 genes, we reasonably excluded 18 genes [a number almost identical to the theoretically expected false positives (i.e., 380 × 0.05 = 19)] given their constraints against deleterious variants and extracted 34 "plausible" candidate genes. Their validity as NDD genes was consistently supported by their similarity in function and gene expression patterns to known NDD genes. Quantifying the overall similarity using deep learning, we identified 11 high-confidence (> 90% true-positive probabilities) candidate genes: HDAC2, SUPT16H, HECTD4, CHD5, XPO1, GSK3B, NLGN2, ADGRB1, CTR9, BRD3, and MARK2.

Conclusions: We identified dozens of new candidates for NDD genes. Both the methods and the resources developed here will contribute to the further identification of novel NDD-associated genes.

Keywords: Autism spectrum disorder; Copy number variant; Copy number variation; De novo variant; Deep learning; Epileptic encephalopathy; Intellectual disability; Mutation rate; Neurodevelopmental disorder; Rare disease.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
Framework for estimating mutation rates of < 1 Mb LOF CNVs per gene. a A conceptional overview showing the method for calculating the mutation rates of < 1 Mb LOF dnCNVs per gene. b A scheme depicting the method for selecting training genes. We selected training genes (here, the gene in red) that are LOF-tolerant and flanked by upstream and downstream > 1 Mb regions without any LOF-intolerant genes
Fig. 2
Fig. 2
Contribution of dnCNVs to statistical significance of DNM enrichment analyses. a A plot of q-values of DNM enrichment analyses for each gene before (x-axis) and after (y-axis) combining dnCNV data. The gray diagonal line indicates the line of y = x. The small inset is a magnified image. The dotted lines in the small inset: thresholds for exome-wide statistical significance (q-value = 0.05). b Visualization of the LOF dnCNVs affecting GLTSCR1 in a YCU case. From top to bottom, the plots show the exon–intron structures of the canonical transcripts, LOEUF, CNVs called by the exome hidden markov model (XHMM), and z scores of depth in the XHMM analysis. LOEUF of each gene is shown as a horizontal line corresponding to its genomic region. In the plot of z score for depth, the red line indicates the z score of the case with the LOF dnCNV, and the black lines indicate the z scores of 500 randomly selected control individuals. c IGV images of WGS data of a family with a UBR3 dnCNV (13302) and a family with a MARK2 dnCNV (12103). At the top, coverage and paired-end reads of all family members and exon–intron structures of genes are shown. At the bottom, magnified images of coverage and paired-end reads of the affected proband are shown. In the magnified images, discordant read pairs, whose read one and read two surround a dnCNV, are connected with a black line, and split reads, which span a breakpoint, are connected with a red line. p1, the affected proband; fa, the father; mo, the mother; s1, the healthy sibling
Fig. 3
Fig. 3
Spatiotemporal expression patterns of the 328 known and 34 plausible candidate genes. a Enrichment analyses of genes specifically expressed in each brain region at each developmental stage in the 328 known (the six columns of large hexagons) and 34 plausible new genes (columns of small hexagons on the right of the columns of large hexagons). Sizes of the hexagons for the 328 genes correlate with their gene set sizes. The red colors correspond to q-values of Fisher’s exact tests adjusted by the BH method. The regions of the hexagons for the 328 genes closer to the center of each hexagon correspond to genes with smaller pSI scores, namely, increasing specificity (< 0.05, < 0.01, < 0.001, and < 0.0001, respectively), while the hexagons for the 34 genes correspond to genes with pSI scores < 0.05. b Enrichment analyses of genes of each co-expression module in the 328 known (the upper row) and 34 plausible candidate genes (the lower row). The circle colors correspond to q-values of hypergeometric tests adjusted by the BH method. The circle sizes indicate the ratio of each module proportion in the 328 or 34 genes relative to that in all genes
Fig. 4
Fig. 4
GO terms enriched in the 328 known and 34 plausible candidate genes. a Clusters of GO terms enriched (q-value < 0.01) in the 328 known and 34 plausible candidate genes. Only clusters of ten or more nodes are shown. Each node represents a GO term. Nodes are connected by an edge when the Jaccard and overlap combined coefficient for their gene members is > 0.5. Node size represents the number of gene members. Nodes are colored red when the nodes are statistically significant in the 34 plausible candidate genes. Gray ovals represent manually annotated GO groups. b Histograms of numbers of GO terms enriched (q-value < 0.01) in 34 randomly selected genes. This simulation was repeated 1000 times. In each simulation, only the 1086 terms enriched in the 328 known genes (Additional file 2: Table S13) were analyzed. Red bars indicate the number of GO terms enriched in the 34 plausible candidate genes. Empirical p-values of the enrichment in the 34 genes are the proportion of simulations with a number of GO terms equal to or more than that of the red bars. BP, GO biological process terms; CC, GO cellular component terms; MF, GO molecular function terms
Fig. 5
Fig. 5
STRING clusters enriched in the 328 known and 34 plausible candidate genes. STRING clusters whose members are enriched (q-value < 0.01) in the proteins encoded by the 328 known and 34 candidate genes. Nodes are clustered according to the similarity of their members. Nodes are connected by an edge when the Jaccard and overlap combined coefficient for their members is > 0.375. Gray nodes: STRING clusters significantly enriched in the 328 known genes; red nodes: STRING clusters significantly enriched in the 34 candidate genes. Gray ovals: groups of nodes with similar annotations
Fig. 6
Fig. 6
Integration of the bioinformatic analysis results using deep learning. a Scheme for the NN model. White circle: neurons of layers; line: connections between neurons. b AUC of the full NN model, the eight predictors, and the three existing gene prioritization metrics for PC3 and NC3. The blue violin plot for the NN model (“NN”) represents the distribution based on 500 full NN models, with a red dot indicating the median. c Violin plots of the full NN model scores of various gene sets. PL: the 34 plausible candidate genes. P-values of one-tailed Wilcoxon rank-sum tests are shown above. d Posterior probabilities that the 34 plausible candidate genes are true NDD-associated genes. The probabilities are the median of probabilities computed by 100 full NN models. NN model scores are shown in parentheses. Genes are arranged in the order of NN model scores. Dotted line: 90%

References

    1. Kaplanis J, Samocha KE, Wiel L, Zhang Z, Arvai KJ, Eberhardt RY, et al. Evidence for 28 genetic disorders discovered by combining healthcare and research data. Nature. 2020;586(7831):757–762. doi: 10.1038/s41586-020-2832-5. - DOI - PMC - PubMed
    1. Coe BP, Stessman HAF, Sulovari A, Geisheker MR, Bakken TE, Lake AM, et al. Neurodevelopmental disease genes implicated by de novo mutation and copy number variation morbidity. Nat Genet. 2019;51(1):106–116. doi: 10.1038/s41588-018-0288-4. - DOI - PMC - PubMed
    1. Fitzgerald TW, Gerety SS, Jones WD, van Kogelenberg M, King DA, McRae J, et al. Large-scale discovery of novel genetic causes of developmental disorders. Nature. 2015;519(7542):223–228. doi: 10.1038/nature14135. - DOI - PMC - PubMed
    1. McRae JF, Clayton S, Fitzgerald TW, Kaplanis J, Prigmore E, Rajan D, et al. Prevalence and architecture of de novo mutations in developmental disorders. Nature. 2017;542(7642):433–438. doi: 10.1038/nature21062. - DOI - PMC - PubMed
    1. Samocha KE, Robinson EB, Sanders SJ, Stevens C, Sabo A, McGrath LM, et al. A framework for the interpretation of de novo mutation in human disease. Nat Genet. 2014;46(9):944–950. doi: 10.1038/ng.3050. - DOI - PMC - PubMed

Publication types