Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 May 3;39(5):msac105.
doi: 10.1093/molbev/msac105.

Domain Expansion and Functional Diversification in Vertebrate Reproductive Proteins

Affiliations

Domain Expansion and Functional Diversification in Vertebrate Reproductive Proteins

Alberto M Rivera et al. Mol Biol Evol. .

Abstract

The rapid evolution of fertilization proteins has generated remarkable diversity in molecular structure and function. Glycoproteins of vertebrate egg coats contain multiple zona pellucida (ZP)-N domains (1-6 copies) that facilitate multiple reproductive functions, including species-specific sperm recognition. In this report, we integrate phylogenetics and machine learning to investigate how ZP-N domains diversify in structure and function. The most C-terminal ZP-N domain of each paralog is associated with another domain type (ZP-C), which together form a "ZP module." All modular ZP-N domains are phylogenetically distinct from nonmodular or free ZP-N domains. Machine learning-based classification identifies eight residues that form a stabilizing network in modular ZP-N domains that is absent in free domains. Positive selection is identified in some free ZP-N domains. Our findings support that strong purifying selection has conserved an essential structural core in modular ZP-N domains, with the relaxation of this structural constraint allowing free N-terminal domains to functionally diversify.

Keywords: fertilization; gene duplication; machine learning; molecular evolution; phylogenetics; protein structure.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Phylogenetic analysis of ZP-N domain duplication history. (A) A structural alignment of mouse ZP2-N1 and ZP3-N highlights the broad structural conservation of these two classes of ZP-N domains (RMSD = ∼4.7 Å) despite only ∼18% amino acid sequence identity. The protein schematics summarize the ZP proteins included in this analysis. (B) Phylogenetic analysis (Kozlov et al. 2019) of ZP-N sequences (shown as a maximum likelihood tree) supports an ancestral separation between free and modular ZP-N domains (∼78% support). (C) A summary of ZP-N domain evolution based on the gene tree in (B). The ancestral protein contained a ZP module with a C-terminal ZP-N and ZP-C domains, and duplication of the ZP-N produced the most N-terminal domain found in ZP1, ZP4, ZP2, and ZPAX. Later duplication events within ZP2 and ZPAX gave rise to multiple additional ZP-N domains between ZP-N1 and the ZP module.
Fig. 2.
Fig. 2.
Machine learning–based inference of sequence features that distinguish modular and free ZP-N domains. A logistic regression model with elastic net regularization was trained on the ZP-N multiple sequence alignment generated as part of the phylogenetic analysis, with the data partitioned for training and testing (75% and 25%, respectively), with five-way cross-validation of the training data employed to estimate the error distribution of the score function. We defined our optimal model as the most parsimonious model (i.e., the fewest parameters) within the estimated 95% confidence interval of the unregularized model. (A) The space of regularization hyperparameters was explored during model optimization, plotted as a 3D surface (left). The score is the negative mean-squared error, and the dots correspond to the 2D cross-section shown on the right, with the blue line denoting the intersection between the lower confidence limit of the unregularized model to its intersection with the score as a function of regularization strength. (B) Comparison of the unregularized and optimal logistic regression models as LOGO plots with the height of each amino acid at each position corresponding to its parameter weight, with colored amino acids denoting parameters retained in the regularized model (orange for modular; green for free). Each parameter weight approximating the logs odd ratio for a modular domain prediction, when a residue is present at that position. (C) Sequence LOGOs were constructed for individual clades within the phylogeny. They emphasize the conservation of residues within the modular ZP-N clade. There is also greater conservation of a characteristic ZP-N disulfide bond in the most N-terminal ZP-Ns compared with other free domains. (D) Mapping highly predictive sites onto ZP-N protein models suggest differences in structural properties between free and modular domains. The available crystal structure ZP3-N (3d4c) was used and modeled as a dimer for spatial context. Modular-associated sites are generally buried along the outer edge of the homodimer.
Fig. 3.
Fig. 3.
Amino acid diversity and tests of positive selection in modular and free ZP-N domains. (A) A heatmap showing the within-group and between-group mean phylogenetic distances for the orthologous groups of ZP-N domains (Kumar et al. 2018). (B) Positively selected sites in mammalian ZP2-N1 and ZP2-N2 were identified through maximum likelihood analysis and mapped onto protein models (4wrn for ZP2-N1 and an AlphaFold prediction for ZP2-N2) (Yang 2007).

Similar articles

Cited by

References

    1. Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25(17):3389–3402. - PMC - PubMed
    1. Anisimova M, Liberles D. 2012. Detecting and understanding natural selection. In: Cannarozzi G, Schneider A, editors. Codon evolution mechanisms and models. Oxford: Oxford University Press.
    1. Avella MA, Baibakov B, Dean J. 2014. A single domain of the ZP2 zona pellucida protein mediates gamete recognition in mice and humans. J Cell Biol. 205(6):801–809. - PMC - PubMed
    1. Avella MA, Xiong B, Dean J. 2013. The molecular basis of gamete recognition in mice and humans. Mol Hum Reprod. 19(5):279–289. - PMC - PubMed
    1. Benjamini Y, Hochberg Y. 1995. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B (Methodol). 57(1):289–300.

Publication types