Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2025 Jun 25:2024.09.26.615256.
doi: 10.1101/2024.09.26.615256.

Human-specific gene expansions contribute to brain evolution

Affiliations

Human-specific gene expansions contribute to brain evolution

Daniela C Soto et al. bioRxiv. .

Update in

  • Human-specific gene expansions contribute to brain evolution.
    Soto DC, Uribe-Salazar JM, Kaya G, Valdarrago R, Sekar A, Haghani NK, Hino K, La G, Mariano NAF, Ingamells C, Baraban A, Jamal Z, Turner TN, Green ED, Simó S, Quon G, Andrés AM, Dennis MY. Soto DC, et al. Cell. 2025 Sep 18;188(19):5363-5383.e22. doi: 10.1016/j.cell.2025.06.037. Epub 2025 Jul 21. Cell. 2025. PMID: 40695280

Abstract

Duplicated genes expanded in the human lineage likely contributed to brain evolution, yet challenges exist in their discovery due to sequence-assembly errors. We used a complete telomere-to-telomere genome sequence to identify 213 human-specific gene families. From these, 362 paralogs were found in all modern human genomes tested and brain transcriptomes, making them top candidates contributing to human-universal brain features. Choosing a subset of paralogs, long-read DNA sequencing of hundreds of modern humans revealed previously hidden signatures of selection, including for T-cell marker CD8B. To understand roles in brain development, we generated zebrafish CRISPR "knockout" models of nine orthologs and introduced mRNA-encoding paralogs, effectively "humanizing" larvae. Our findings implicate two genes in possibly contributing to hallmark features of the human brain: GPR89B in dosage-mediated brain expansion and FRMPD2B in altered synapse signaling. Our holistic approach provides insights and a comprehensive resource for studying gene expansion drivers of human brain evolution.

Keywords: brain; copy-number variation; gene duplications; gene expression; human evolution; neurodevelopment; segmental duplications; sequencing; zebrafish.

PubMed Disclaimer

Conflict of interest statement

Declaration of Interests Authors have nothing to disclose.

Figures

Figure 1.
Figure 1.. Genetic analysis of human-duplicated genes.
(A) Diagram of segmental duplications (SDs; blue) and subset with >98% identity (SD98; orange) in T2T-CHM13 autosomes, including total number of nucleotides (Mbp) and genes overlapping SD98 regions. (B) Copy number (CN) estimation methods, including gene-family CN (famCN) and paralog-specific CN (parCN). Horizontal lines represent short reads mapping to unique (gray) and duplicated regions (orange and yellow). Heatmaps indicate CN estimates. (C) Pipeline for clustering and stratification of SD98 genes based on synteny with the chimpanzee reference and famCN comparisons between human and nonhuman primates (NHPs) (left). CN-constrained (fixed or nearly fixed) genes were flagged based on parCN values across human populations (right). (D) UCSC Genome Browser snapshot including gene models, centromeric satellites (CenSat), SDs (SegDup), and famCN and parCN predictions across sequenced individuals. (E) Distribution of Tajima’s D values (y-axis) from 1KGP individuals of European (EUR) ancestry genome wide (gray) and SD98 (orange) across human autosomal chromosomes (x-axis). SD98 windows above the 95th (red line) or below the 5th (blue line) percentiles are considered outlier D values (STAR Methods). All human-duplicated gene names with outlier D values in at least one tested ancestry are labeled. Also see Table S1 and Figure S1.
Figure 2.
Figure 2.. Duplicated gene expression in the developing human brain.
(A) Counts of human-duplicated genes with transcripts per million (TPM) >1 in fetal brain datasets including germinal zones (VZ: ventricular zone, ISVZ: inner subventricular zone, OSVZ: outer subventricular zone, CP: cortical plate), neuronal progenitor cells (NPCs) (aRGs: apical radial glia, bRGs: basal radial glia), neuroblastoma cell line (SH-SY5Y), BrainSpan, and CORTECON. Protein-encoding genes are represented in darker shades. (B) Counts of expressed (dark orange) and non-expressed (light orange) human-duplicated genes across gene categories. (C) Human-duplicated gene expression in the CORTECON dataset stratified by copy number (CN). (D) Pipeline used for the weighted gene co-expression analysis (WGCNA). (E) The BrainSpan B-turquoise module, exhibiting an enrichment of human-duplicated genes (#) and autism-associated genes (*) plotted over developmental time (post-conception weeks, PCW) and bar colors representing brain regions (see D). Gene-ontology (GO) terms overrepresented among the co-expressed B-turquoise genes are depicted on the right. (F) Selected CORTECON WGCNA modules with enrichments (see E) and overrepresented GO terms indicated below. (G) CORTECON module assignment concordance scores are shown on the vertical axis for human-duplicated gene families. The size of each point corresponds to the number of members in the respective gene family. Also see Table S2, Figure S2, and Data S1.
Figure 3.
Figure 3.. Modeling functions of duplicated genes in brain development.
Scaled TPMs from the human BrainSpan dataset, and pseudo-bulk single-cell transcriptomes from whole-brain dissected samples of mouse and zebrafish. Gene families pictured represent a subset of CN-constrained and brain-expressed human-duplicated gene families with those highlighted with black bars prioritized for additional characterization. Also see Table S3 and Figure S3.
Figure 4.
Figure 4.. Genetic variation and signatures of selection of top candidate human-duplicated genes.
(A) Number of likely gene-disruptive (LGD) (red), missense (blue), and synonymous (green) variants identified in pHSD genes. (B) Ka/Ks and (C) direction of selection (DoS) of pHSD genes with dashed lines indicating average genome-wide values between humans and chimpanzees (red) and neutrality (blue). Differences between matched ancestral and derived paralogs were tested with the Wilcoxon signed-rank test. Paralogs with infinite values or undetermined ancestral/derived state (hollow dots) were excluded from comparisons. (D) CD8B locus overview, including Tajima’s D values derived from 1KGP genome-wide SNVs (top panel). Biallelic SNVs from the Human Pangenome Reference Consortium (HPRC) and the Human Genome Structural Variation Consortium (HGSVC) assemblies are shown with with a minor allele frequency greater than 0.3 in individuals of African (AFR, n=27) and American (AMR, n=18) ancestry (middle panel) and used to calculate Tajima’s D values (bottom panel). (E) Scaled transcript per million (TPM) expression of CD8B and CD8B2 in postmortem brain tissue from BrainSpan. Also see Table S4 and Figure S4.
Figure 5.
Figure 5.. Duplicated gene functions modeled using zebrafish.
(A) Functions of each pHSD gene were tested by generating knockout (KO, or morpholino) and ‘humanized’ models (injection of mRNA). (B) The F1 score, generated using a supervised convolutional neural network (CNN), is plotted indicating the effect size of morphological difference between models and controls, either using our batch-corrected images (blue bars) or original data (orange bars). Higher F1 score indicates greater difference. The bars for the control group indicate on average how distinct the controls are from all other groups. A threshold F1 score of 0.2 was used to define models being robustly classified as different from their control group. Pictured as a top inset are feature attribution plots with colors highlighting the region of the image used by the CNN to correctly classify and distinguish those genotypes from controls. (C) Measurements of selected pHSD gene families with heatmaps representing the percent change compared to the control group (asterisks indicating a Benjamini Hochberg-corrected p-value<0.05). (D) t-distributed stochastic neighbor embedding (tSNE) plot highlighting classified cell types from scRNA-seq data at 3 dpf. (E) Fold-change comparison between KO and humanized models for each pHSD across all genes (n=29,945), versus their controls. Black lines represent the Pearson correlation line and the dotted lines the 95% confidence intervals. (F) Endogenous z-score scaled expression of each zebrafish ortholog across defined scRNA-seq cell types. Circle sizes scale with the overall number of cells included in that group. (G) Distribution of cell-type-specific differentially expressed genes (DEGs) for each pHSD model. Each square includes the downregulated genes in blue (lower diagonal) and upregulated genes in red (upper diagonal). Circles next to each cell type represent the number of expressed genes. (H) Gene ontology (GO) enrichment results for the top overrepresented terms in upregulated genes in forebrain and midbrain across pHSD models, with gray indicating genes with no DEGs. Significant q-value>0.05 indicated with asterisk on color legend. Also see Table S5 and Figure S5.
Figure 6.
Figure 6.. Neurodevelopmental impact of GPR89 and FRMPD2.
(A) Head and brain area assessments at 3 dpf for G0 crispants and stable knockout lines. p-values are indicated above box plots versus controls using an ANCOVA with a rank-transformation (humanized and crispant models) and Wilcoxon signed-rank tests (stable knockout lines). Representative images of each model in the neuronal transgenic line are included with scale bars representing 100 μm. (B) t-distributed stochastic neighbor embedding (tSNE) plot showing the identified subregions classified from the forebrain (n=10,040 cells) and relative scaled endogenous expression across cell types. (C) Log2 fold change (FC) of gene expression versus controls in cells from the telencephalon between knockout and humanized models. Red and blue colors correspond to DEGs discordant (GPR89) or concordant (FRMPD2) between the knockout and humanized models and their top representative gene ontology (GO) enrichment. (D) Forest plot with the results from the logistic regression for presence of progenitor versus differentiated states in forebrain cells. (E) Diagrams of the duplication event of GPR89 with different expression patterns (**Wilcoxon signed-rank test, p-value<0.005). A model of GPR89B gain-of-function in neuronal proliferation amplification is depicted on the right. (F) Behavioral results from 1 h motion-tracking evaluations in 4 dpf larvae exposed (2.5 mM) or not (0 mM) to pentylenetetrazol (PTZ) with high-speed events (HSE) defined as movement ≥28 mm/s. Colors represent the FC relative to the control group and the asterisk indicates a significant Dunn’s test (p<0.05 Benjamini-Hochberg-adjusted). (G) Diagram of the duplication event of FRMPD2 (see also E), with a model of FRMPD2B antagonistic functions resulting in altered synaptic signaling depicted on the right. Also see Table S6.

References

    1. Carroll S.B. (2003). Genetics and the making of Homo sapiens. Nature 422, 849–857. - PubMed
    1. Varki A., and Altheide T.K. (2005). Comparing the human and chimpanzee genomes: Searching for needles in a haystack. Preprint, 10.1101/gr.3737405 https://doi.org/10.1101/gr.3737405. - DOI - DOI - PubMed
    1. Pääbo S. (2014). The Human Condition—A Molecular Approach. Preprint, 10.1016/j.cell.2013.12.036 https://doi.org/10.1016/j.cell.2013.12.036. - DOI - DOI - PubMed
    1. Pollen A.A., Kilik U., Lowe C.B., and Camp J.G. (2023). Human-specific genetics: new tools to explore the molecular and cellular basis of human evolution. Nat. Rev. Genet. 24, 687–711. - PMC - PubMed
    1. Sousa A.M.M., Meyer K.A., Santpere G., Gulden F.O., and Sestan N. (2017). Evolution of the Human Nervous System Function, Structure, and Development. Cell 170, 226–247. - PMC - PubMed

Publication types

LinkOut - more resources