phastSim: Efficient simulation of sequence evolution for pandemic-scale datasets
- PMID: 35486906
- PMCID: PMC9094560
- DOI: 10.1371/journal.pcbi.1010056
phastSim: Efficient simulation of sequence evolution for pandemic-scale datasets
Abstract
Sequence simulators are fundamental tools in bioinformatics, as they allow us to test data processing and inference tools, and are an essential component of some inference methods. The ongoing surge in available sequence data is however testing the limits of our bioinformatics software. One example is the large number of SARS-CoV-2 genomes available, which are beyond the processing power of many methods, and simulating such large datasets is also proving difficult. Here, we present a new algorithm and software for efficiently simulating sequence evolution along extremely large trees (e.g. > 100, 000 tips) when the branches of the tree are short, as is typical in genomic epidemiology. Our algorithm is based on the Gillespie approach, and it implements an efficient multi-layered search tree structure that provides high computational efficiency by taking advantage of the fact that only a small proportion of the genome is likely to mutate at each branch of the considered phylogeny. Our open source software allows easy integration with other Python packages as well as a variety of evolutionary models, including indel models and new hypermutability models that we developed to more realistically represent SARS-CoV-2 genome evolution.
Conflict of interest statement
The authors have declared that no competing interests exist.
Figures







Update of
-
phastSim: efficient simulation of sequence evolution for pandemic-scale datasets.bioRxiv [Preprint]. 2021 Sep 23:2021.03.15.435416. doi: 10.1101/2021.03.15.435416. bioRxiv. 2021. Update in: PLoS Comput Biol. 2022 Apr 29;18(4):e1010056. doi: 10.1371/journal.pcbi.1010056. PMID: 33758852 Free PMC article. Updated. Preprint.
Similar articles
-
phastSim: efficient simulation of sequence evolution for pandemic-scale datasets.bioRxiv [Preprint]. 2021 Sep 23:2021.03.15.435416. doi: 10.1101/2021.03.15.435416. bioRxiv. 2021. Update in: PLoS Comput Biol. 2022 Apr 29;18(4):e1010056. doi: 10.1371/journal.pcbi.1010056. PMID: 33758852 Free PMC article. Updated. Preprint.
-
VGsim: Scalable viral genealogy simulator for global pandemic.PLoS Comput Biol. 2022 Aug 24;18(8):e1010409. doi: 10.1371/journal.pcbi.1010409. eCollection 2022 Aug. PLoS Comput Biol. 2022. PMID: 36001646 Free PMC article.
-
Ultrafast Sample placement on Existing tRees (UShER) enables real-time phylogenetics for the SARS-CoV-2 pandemic.Nat Genet. 2021 Jun;53(6):809-816. doi: 10.1038/s41588-021-00862-7. Epub 2021 May 10. Nat Genet. 2021. PMID: 33972780 Free PMC article.
-
Comprehensive analyses of bioinformatics applications in the fight against COVID-19 pandemic.Comput Biol Chem. 2021 Dec;95:107599. doi: 10.1016/j.compbiolchem.2021.107599. Epub 2021 Nov 2. Comput Biol Chem. 2021. PMID: 34773807 Free PMC article. Review.
-
Next-generation computational tools and resources for coronavirus research: From detection to vaccine discovery.Comput Biol Med. 2021 Jan;128:104158. doi: 10.1016/j.compbiomed.2020.104158. Epub 2020 Dec 1. Comput Biol Med. 2021. PMID: 33301953 Free PMC article. Review.
Cited by
-
AliSim: A Fast and Versatile Phylogenetic Sequence Simulator for the Genomic Era.Mol Biol Evol. 2022 May 3;39(5):msac092. doi: 10.1093/molbev/msac092. Mol Biol Evol. 2022. PMID: 35511713 Free PMC article.
-
Maximum likelihood pandemic-scale phylogenetics.bioRxiv [Preprint]. 2022 Jul 18:2022.03.22.485312. doi: 10.1101/2022.03.22.485312. bioRxiv. 2022. Update in: Nat Genet. 2023 May;55(5):746-752. doi: 10.1038/s41588-023-01368-0. PMID: 35350209 Free PMC article. Updated. Preprint.
-
Seamless, rapid, and accurate analyses of outbreak genomic data using split k-mer analysis.Genome Res. 2024 Oct 29;34(10):1661-1673. doi: 10.1101/gr.279449.124. Genome Res. 2024. PMID: 39406504 Free PMC article.
-
Correlated substitutions reveal SARS-like coronaviruses recombine frequently with a diverse set of structured gene pools.Proc Natl Acad Sci U S A. 2023 Jan 31;120(5):e2206945119. doi: 10.1073/pnas.2206945119. Epub 2023 Jan 24. Proc Natl Acad Sci U S A. 2023. PMID: 36693089 Free PMC article.
-
Identifying SARS-CoV-2 regional introductions and transmission clusters in real time.Virus Evol. 2022 Jun 16;8(1):veac048. doi: 10.1093/ve/veac048. eCollection 2022. Virus Evol. 2022. PMID: 35769891 Free PMC article.
References
Publication types
MeSH terms
Grants and funding
LinkOut - more resources
Full Text Sources
Medical
Miscellaneous