Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Jun 24;38(Suppl 1):i134-i142.
doi: 10.1093/bioinformatics/btac242.

Simulating domain architecture evolution

Affiliations

Simulating domain architecture evolution

Xiaoyue Cui et al. Bioinformatics. .

Abstract

Motivation: Simulation is an essential technique for generating biomolecular data with a 'known' history for use in validating phylogenetic inference and other evolutionary methods. On longer time scales, simulation supports investigations of equilibrium behavior and provides a formal framework for testing competing evolutionary hypotheses. Twenty years of molecular evolution research have produced a rich repertoire of simulation methods. However, current models do not capture the stringent constraints acting on the domain insertions, duplications, and deletions by which multidomain architectures evolve. Although these processes have the potential to generate any combination of domains, only a tiny fraction of possible domain combinations are observed in nature. Modeling these stringent constraints on domain order and co-occurrence is a fundamental challenge in domain architecture simulation that does not arise with sequence and gene family simulation.

Results: Here, we introduce a stochastic model of domain architecture evolution to simulate evolutionary trajectories that reflect the constraints on domain order and co-occurrence observed in nature. This framework is implemented in a novel domain architecture simulator, DomArchov, using the Metropolis-Hastings algorithm with data-driven transition probabilities. The use of a data-driven event module enables quick and easy redeployment of the simulator for use in different taxonomic and protein function contexts. Using empirical evaluation with metazoan datasets, we demonstrate that domain architectures simulated by DomArchov recapitulate properties of genuine domain architectures that reflect the constraints on domain order and adjacency seen in nature. This work expands the realm of evolutionary processes that are amenable to simulation.

Availability and implementation: DomArchov is written in Python 3 and is available at http://www.cs.cmu.edu/~durand/DomArchov. The data underlying this article are available via the same link.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Example multidomain protein: proto-oncogene tyrosine-protein kinase Src in human. (a) Domains in the sequence are identified by PFAM (Mistry et al., 2021) HMMs: SH3 (PF00018), SH2 (PF00017), and a protein tyrosine kinase (PF07714). Sequence in linker regions represented as (). (b) The 3D structure of Src, with the SH2, SH3, and kinase folds shown in purple, red, and blue. (c) Src domain architecture, showing its constituent domains in N- to C-terminal order. (d) A sequence LOGO for the PFAM SH3 domain model.
Fig. 2.
Fig. 2.
Schematic showing changes in domain architecture via insertion, duplication, and deletion of domains
Fig. 3.
Fig. 3.
The state transition diagram showing states adjacent to a DA of length n =3. Each stack of circles on the right represents the ND states that can be reached by a domain gain at the associated position.
Fig. 4.
Fig. 4.
MCMC convergence assessment. (a) Gelman Rubin diagnostic applied to DA lengths sampled every 100 iterations with the primate dataset. (b) Event acceptance rate.
Fig. 5.
Fig. 5.
Final DA length as a function of chain length (horizontal axis not to scale). Top panels: Close-up view of the same distribution. Mean DA length shown as solid dots. Horizontal line represents mean length of genuine DAs. Length distributions of genuine DAs are plotted in the rightmost columns. (a) Primates. (b) Fish. (c) Drosophila. (d) Cnidaria.
Fig. 6.
Fig. 6.
Frequency of accepted gain (left axis) and loss (right axis) positions; bars for gains (blue) and losses (red) are interleaved, starting with gains.

References

    1. Andreeva A. et al. (2020) The SCOP database in 2020: expanded classification of representative family and superfamily domains of known protein structures. Nucleic Acids Res., 48, D376–D382. - PMC - PubMed
    1. Apic G. et al. (2003) Multi-domain protein families and domain pairs: comparison with known structures and a random model of domain recombination. J. Struct. Funct. Genomics, 4, 67–78. - PubMed
    1. Bashton M., Chothia C. (2002) The geometry of domain combination in proteins. J. Mol. Biol., 315, 927–939. - PubMed
    1. Basu M. et al. (2008) Evolution of protein domain promiscuity in eukaryotes. Genome Res., 18, 449–461. - PMC - PubMed
    1. Basu M.K. et al. (2009) Domain mobility in proteins: functional and evolutionary implications. Brief. Bioinform., 10, 205–216. - PMC - PubMed

Publication types