Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Apr 9;5(4):100807.
doi: 10.1016/j.xgen.2025.100807. Epub 2025 Mar 11.

Proteome-wide assessment of differential missense variant clustering in neurodevelopmental disorders and cancer

Affiliations

Proteome-wide assessment of differential missense variant clustering in neurodevelopmental disorders and cancer

Jeffrey K Ng et al. Cell Genom. .

Abstract

Prior studies examining genomic variants suggest that some proteins contribute to both neurodevelopmental disorders (NDDs) and cancer. While there are several potential etiologies, here, we hypothesize that missense variation in proteins occurs in different clustering patterns, resulting in distinct phenotypic outcomes. This concept was first explored in 1D protein space and expanded using 3D protein structure models. Missense de novo variants were examined from 39,883 families with NDDs and missense somatic variants from 10,543 sequenced tumors covering five The Cancer Genome Atlas (TCGA) cancer types and two Catalog of Somatic Mutations in Cancer (COSMIC) pan-cancer aggregates of tissue types. We find 18 proteins with differential missense variation clustering in NDDs compared to cancers and 19 in cancers relative to NDDs. These proteins may be important for detailed assessments in thinking of future prognostic and therapeutic applications. We establish a framework for interpreting missense patterns in NDDs and cancer, using advances in 3D protein structure prediction.

Keywords: 3D protein structure models; cancer; clustering algorithm; de novo; missense; neurodevelopmental disorders; protein; somatic; variant interpretation.

PubMed Disclaimer

Conflict of interest statement

Declaration of interests The authors declare no competing interests.

Figures

None
Graphical abstract
Figure 1
Figure 1
CLUMP and 3D-CLUMP workflow Shown is a workflow that details the steps taken to run CLUMP and 3D-CLUMP.
Figure 2
Figure 2
Schematic of examples of the CLUMP and 3D-CLUMP methods (A) Two proteins are shown: one where there is more clustering in NDDs (top) and one where there is more clustering in cancer (bottom). (B) Our AlphaFold2 prediction for NP_002065.1 (used as an example only), where variants are placed to exemplify more clustering in NDDs (left) and more clustering in cancer (right). (C) NDD data consisted of 39,883 parent-child sequenced trios (the lightning bolt is used to exemplify DNVs which, by definition, are only found in children). (D) Cancer data consisting of 10,543 individuals from the TCGA and COSMIC databases.
Figure 3
Figure 3
Chicago plots for the 3D-CLUMP results in the COSMIC datasets (A) Chicago plot for 3D-CLUMP results in the NDD versus COSMIC CNS analyses. (B) Chicago plot for 3D-CLUMP results in the NDD versus COSMIC GI analyses. For both (A) and (B), proteins that were significantly clustered in NDDs but not in cancer are shown on the top above the significance line, and proteins that were significantly clustered in cancer but not in NDDs are shown on the bottom below the significance line. Proteins are placed based on the genomic coordinates of the genes that encode them, and all significant proteins are labeled on the plots. The color of the dots represents the chromosomes on which they are found. Named proteins indicate proteins that passed the protein structure filtering as well as the p value threshold. There were 39,883 families in the NDD set, 2,415 individuals in the CNS set, and 5,323 individuals in the GI set. Each individual was a distinct sample. The statistical test was 3D-CLUMP and was performed in a two-sided manner. Calculation of the p value in 3D-CLUMP is performed by generating a null distribution of values 10,000,000 times to enable testing for proteome-wide significance. A Bonferroni-corrected p value threshold of 2.895 × 10−7 was used for proteome-wide significance.
Figure 4
Figure 4
Examples of proteins with proteome-wide significant differential clustering of missense variants in NDDs Shown are proteins that were significantly clustered in NDDs but not in cancer. Shown in red are variants seen in individuals with NDDs and in blue mutations in individuals with cancer. Shown in black are those seen in both. The intensity of the color is scaled by the number of individuals with missense variants at the residue. Shown are (A) NDD versus BRCA DDX3X (NP_001350748), (B) NDD versus COAD ATP1A3 (NP_001243143), (C) NDD versus LUAD PTPN11 (NP_001317366), and (D) NDD vs. GI ATPA1 (NP_689509). The colored ribbon represents the per-residue pLDDT scores, where orange is pLDDT <50, yellow is 50 < pLDDT <70, light blue is 70 < pLDDT <90, and darker blue pLDDT >90. Please note that the ribbon color does not represent any mutation data. There were 39,883 families in the NDD set, 1,025 individuals in the BRCA set, 408 individuals in the COAD set, 569 individuals in the LUAD set, and 5,323 individuals in the GI set. The statistical test was 3D-CLUMP and was performed in a two-sided manner. Calculation of the p value in 3D-CLUMP is performed by generating a null distribution of values 10,000,000 times to enable testing for proteome-wide significance. A Bonferroni-corrected p value threshold of 2.895 × 10−7 was used for proteome-wide significance.
Figure 5
Figure 5
Examples of proteins with proteome-wide significant differential clustering of missense variants in cancer Shown are proteins that were significantly clustered in cancer but not in NDDs. Red variants are seen in individuals with NDDs and blue mutations in individuals with cancer. Numbers are shown on certain residues to indicate the number of individuals. Those in black are seen in both. The intensity of the color is scaled by the number of individuals with missense variants at the residue. Shown are (A) NDD versus PRAD SPOP (NP_003554), (B) NDD versus CNS PIK3CA (NP_006209), and (C) NDD versus GI GNAS (NP_000507). The colored ribbon represents the per-residue pLDDT scores, where orange is pLDDT <50, yellow is 50 < pLDDT <70, light blue is 70 < pLDDT <90, and darker blue is pLDDT >90. Please note that the ribbon color does not represent any mutation data. There were 39,883 families in the NDD set, 495 individuals in the PRAD set, 2,415 individuals in the CNS set, and 5,323 individuals in the GI set. The statistical test was 3D-CLUMP and was performed in a two-sided manner. Calculation of the p value in 3D-CLUMP is performed by generating a null distribution of values 10,000,000 times to enable testing for proteome-wide significance. A Bonferroni-corrected p value threshold of 2.895 × 10−7 was used for proteome-wide significance.
Figure 6
Figure 6
Significant differentially clustered NDD protein network interactions We show the interactions between the proteins that were significantly clustered in NDDs but not in cancer. The PPI enrichment was significant at a p value of 0.0229. The analysis was done using STRING-DB. Edges represent protein-protein associations. Light blue edges represent known interactions from curated databases; those in purple are experimentally determined interactions. Black lines show co-expression, and green shows that they are associated via text mining. The node colors do not have any meaning. There were 18 proteins with differential missense variation clustering in NDDs compared to cancers that were used as input to the STRING-DB program. STRING-DB provides an enrichment p value based on a hypergeometric test to compare the observed number of edges in a PPI network to the expected number.

Update of

References

    1. Bamshad M.J., Nickerson D.A., Chong J.X. Mendelian Gene Discovery: Fast and Furious with No End in Sight. Am. J. Hum. Genet. 2019;105:448–455. doi: 10.1016/j.ajhg.2019.07.011. - DOI - PMC - PubMed
    1. Kaplanis J., Samocha K.E., Wiel L., Zhang Z., Arvai K.J., Eberhardt R.Y., Gallone G., Lelieveld S.H., Martin H.C., McRae J.F., et al. Evidence for 28 genetic disorders discovered by combining healthcare and research data. Nature. 2020;586:757–762. doi: 10.1038/s41586-020-2832-5. - DOI - PMC - PubMed
    1. Tokheim C., Karchin R. CHASMplus Reveals the Scope of Somatic Missense Mutations Driving Human Cancers. Cell Syst. 2019;9:9–23.e8. doi: 10.1016/j.cels.2019.05.005. - DOI - PMC - PubMed
    1. Iqbal S., Pérez-Palma E., Jespersen J.B., May P., Hoksza D., Heyne H.O., Ahmed S.S., Rifat Z.T., Rahman M.S., Lage K., et al. Comprehensive characterization of amino acid positions in protein structures reveals molecular effect of missense variants. Proc. Natl. Acad. Sci. USA. 2020;117:28201–28211. doi: 10.1073/pnas.2002660117. - DOI - PMC - PubMed
    1. Hicks M., Bartha I., di Iulio J., Venter J.C., Telenti A. Functional characterization of 3D protein structures informed by human genetic diversity. Proc. Natl. Acad. Sci. USA. 2019;116:8960–8965. doi: 10.1073/pnas.1820813116. - DOI - PMC - PubMed