. 2022 Mar 3;220(3):iyab229.

doi: 10.1093/genetics/iyab229.

Efficient ancestry and mutation simulation with msprime 1.0

Franz Baumdicker¹, Gertjan Bisschop², Daniel Goldstein^{3

4}, Graham Gower⁵, Aaron P Ragsdale⁶, Georgia Tsambos⁷, Sha Zhu⁸, Bjarki Eldon⁹, E Castedo Ellerman¹⁰, Jared G Galloway^{11

12}, Ariella L Gladstein^{13

14}, Gregor Gorjanc¹⁵, Bing Guo¹⁶, Ben Jeffery⁸, Warren W Kretzschumar¹⁷, Konrad Lohse², Michael Matschiner¹⁸, Dominic Nelson¹⁹, Nathaniel S Pope²⁰, Consuelo D Quinto-Cortés²¹, Murillo F Rodrigues¹¹, Kumar Saunack²², Thibaut Sellinger²³, Kevin Thornton²⁴, Hugo van Kemenade, Anthony W Wohns^{8

4}, Yan Wong⁸, Simon Gravel¹⁹, Andrew D Kern¹¹, Jere Koskela²⁵, Peter L Ralph^{11

26}, Jerome Kelleher⁸

Affiliations

¹ Cluster of Excellence "Controlling Microbes to Fight Infections", Mathematical and Computational Population Genetics, University of Tübingen, 72076 Tübingen, Germany.
² Institute of Evolutionary Biology, The University of Edinburgh, Edinburgh EH9 3FL, UK.
³ Khoury College of Computer Sciences, Northeastern University, Boston, MA 02115, USA.
⁴ Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA.
⁵ Lundbeck GeoGenetics Centre, Globe Institute, University of Copenhagen, 1350 Copenhagen K, Denmark.
⁶ Department of Integrative Biology, University of Wisconsin-Madison, Madison, WI 53706, USA.
⁷ Melbourne Integrative Genomics, School of Mathematics and Statistics, University of Melbourne, Parkville, VIC 3010, Australia.
⁸ Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford OX3 7LF, UK.
⁹ Leibniz Institute for Evolution and Biodiversity Science, Museum für Naturkunde, Berlin 10115, Germany.
¹⁰ Fresh Pond Research Institute, Cambridge, MA 02140, USA.
¹¹ Department of Biology, Institute of Ecology and Evolution, University of Oregon, Eugene, OR 97403-5289, USA.
¹² Computational Biology Program, Fred Hutchinson Cancer Research Center, Seattle, WA 98102, USA.
¹³ Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599-7264, USA.
¹⁴ Embark Veterinary, Inc., Boston, MA 02111, USA.
¹⁵ The Roslin Institute and Royal (Dick) School of Veterinary Studies, University of Edinburgh, Edinburgh EH25 9RG, UK.
¹⁶ Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD 21201, USA.
¹⁷ Center for Hematology and Regenerative Medicine, Karolinska Institute, 141 83 Huddinge, Sweden.
¹⁸ Natural History Museum, University of Oslo, 0318 Oslo, Norway.
¹⁹ Department of Human Genetics, McGill University, Montréal, QC H3A 0C7, Canada.
²⁰ Department of Entomology, Pennsylvania State University, State College, PA 16802, USA.
²¹ National Laboratory of Genomics for Biodiversity (LANGEBIO), Unit of Advanced Genomics, CINVESTAV, Irapuato, Mexico.
²² IIT Bombay, Powai, Mumbai 400 076, India.
²³ Professorship for Population Genetics, Department of Life Science Systems, Technical University of Munich, 85354 Freising, Germany.
²⁴ Department of Ecology and Evolutionary Biology, University of California, Irvine, CA 92697, USA.
²⁵ Department of Statistics, University of Warwick, Coventry CV4 7AL, UK.
²⁶ Department of Mathematics, University of Oregon, Eugene, OR 97403-5289, USA.

PMID: 34897427
PMCID: PMC9176297
DOI: 10.1093/genetics/iyab229

Efficient ancestry and mutation simulation with msprime 1.0

Franz Baumdicker et al. Genetics. 2022.

. 2022 Mar 3;220(3):iyab229.

doi: 10.1093/genetics/iyab229.

Authors

Affiliations

¹ Cluster of Excellence "Controlling Microbes to Fight Infections", Mathematical and Computational Population Genetics, University of Tübingen, 72076 Tübingen, Germany.
² Institute of Evolutionary Biology, The University of Edinburgh, Edinburgh EH9 3FL, UK.
³ Khoury College of Computer Sciences, Northeastern University, Boston, MA 02115, USA.
⁴ Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA.
⁵ Lundbeck GeoGenetics Centre, Globe Institute, University of Copenhagen, 1350 Copenhagen K, Denmark.
⁶ Department of Integrative Biology, University of Wisconsin-Madison, Madison, WI 53706, USA.
⁷ Melbourne Integrative Genomics, School of Mathematics and Statistics, University of Melbourne, Parkville, VIC 3010, Australia.
⁸ Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford OX3 7LF, UK.
⁹ Leibniz Institute for Evolution and Biodiversity Science, Museum für Naturkunde, Berlin 10115, Germany.
¹⁰ Fresh Pond Research Institute, Cambridge, MA 02140, USA.
¹¹ Department of Biology, Institute of Ecology and Evolution, University of Oregon, Eugene, OR 97403-5289, USA.
¹² Computational Biology Program, Fred Hutchinson Cancer Research Center, Seattle, WA 98102, USA.
¹³ Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599-7264, USA.
¹⁴ Embark Veterinary, Inc., Boston, MA 02111, USA.
¹⁵ The Roslin Institute and Royal (Dick) School of Veterinary Studies, University of Edinburgh, Edinburgh EH25 9RG, UK.
¹⁶ Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD 21201, USA.
¹⁷ Center for Hematology and Regenerative Medicine, Karolinska Institute, 141 83 Huddinge, Sweden.
¹⁸ Natural History Museum, University of Oslo, 0318 Oslo, Norway.
¹⁹ Department of Human Genetics, McGill University, Montréal, QC H3A 0C7, Canada.
²⁰ Department of Entomology, Pennsylvania State University, State College, PA 16802, USA.
²¹ National Laboratory of Genomics for Biodiversity (LANGEBIO), Unit of Advanced Genomics, CINVESTAV, Irapuato, Mexico.
²² IIT Bombay, Powai, Mumbai 400 076, India.
²³ Professorship for Population Genetics, Department of Life Science Systems, Technical University of Munich, 85354 Freising, Germany.
²⁴ Department of Ecology and Evolutionary Biology, University of California, Irvine, CA 92697, USA.
²⁵ Department of Statistics, University of Warwick, Coventry CV4 7AL, UK.
²⁶ Department of Mathematics, University of Oregon, Eugene, OR 97403-5289, USA.

PMID: 34897427
PMCID: PMC9176297
DOI: 10.1093/genetics/iyab229

Abstract

Stochastic simulation is a key tool in population genetics, since the models involved are often analytically intractable and simulation is usually the only way of obtaining ground-truth data to evaluate inferences. Because of this, a large number of specialized simulation programs have been developed, each filling a particular niche, but with largely overlapping functionality and a substantial duplication of effort. Here, we introduce msprime version 1.0, which efficiently implements ancestry and mutation simulations based on the succinct tree sequence data structure and the tskit library. We summarize msprime's many features, and show that its performance is excellent, often many times faster and more memory efficient than specialized alternatives. These high-performance features have been thoroughly tested and validated, and built using a collaborative, open source development model, which reduces duplication of effort and promotes software quality via community engagement.

Keywords: Ancestral Recombination Graphs; coalescent; mutations; simulation.

PubMed Disclaimer

Figures

**Figure 1**
Visualization of the separation between ancestry and mutation simulation. (A) The result of an invocation of sim_ancestry is two trees along a 1 kb chunk of genome relating three diploid samples. Each diploid individual consists of two genomes (or nodes), indicated by color. (B) This ancestry is provided as the input to sim_mutations, which adds mutations. Graphics produced using tskit’s draw_svg method.

**Figure 2**
An example tree sequence describing genealogies and sequence variation for four samples at 10 sites on a chromosome of 20 bases long. Information is stored in a set of tables (the tables shown here include only essential columns, and much more information can be associated with the various entities). The node table stores information about sampled and ancestral genomes. The edge table describes how these genomes are related along a chromosome, and defines the genealogical tree at each position. The site and mutation tables together describe sequence variation among the samples. The genotype matrix and tree topologies shown on the left are derived from these tables.

**Figure 3**
Time required to run sim_mutations on tree sequences generated by sim_ancestry (with a population size of 10⁴ and recombination rate of $10^{- 8}$ ) for varying (haploid) sample size and sequence length. We ran 10 replicate mutation simulations each for three different mutation rates, and report the average CPU time required (Intel Core i7-9700). (A) Holding sequence length fixed at 10 megabases and varying the number of samples (tree tips) from 10 to 100,000. (B) Holding number of samples fixed at 1000, and varying the sequence length from 1 to 100 megabases.

**Figure 4**
Comparison of simulation performance using msprime (sim_ancestry), SimBac, and fastSimBac for varying (haploid) sample sizes, and the current estimates for *E. coli* parameters (Lapierre *et al.* 2016): a 4.6 Mb genome, $N_{e} = 1.8 \times 10^{8}$ , gene conversion rate of $8.9 \times 10^{- 11}$ per base and mean tract length of 542. We report (A) the total CPU time and (B) maximum memory usage averaged over five replicates (Intel Xeon E5-2680 CPU). We did not run SimBac beyond first two data points because of the very long running times.

**Figure 5**
(A) A simple ARG in which a recombination occurs at position 0.3; (B) the equivalent topology depicted as a tree sequence, including the recombination node; (C) the same tree sequence topology “simplified” down to its minimal tree sequence representation. Note the original node IDs have been retained for clarity.

**Figure 6**
Comparison of selective sweep simulation performance in msprime (sim_ancestry) and discoal (Intel Xeon Gold 6148 CPU). We report the average CPU time and maximum memory usage when simulating three replicates for 100 diploid samples in a model with a single selective sweep in its history, where the beneficial allele had a selection coefficient of s = 0.05, a per-base recombination rate of $10^{- 8}$ , population size of $N = 10^{4}$ , and sequence length varying from 100 kb–3000 kb.

**Figure 7**
Comparison of DTWF simulation performance in msprime (sim_ancestry) and ARGON (Intel Xeon E5-2680 CPU). We ran simulations with a population size of 10⁴ and recombination rate of $10^{- 8}$ , with 500 diploid samples and varying sequence length. We report (A) total CPU time and (B) maximum memory usage; each point is the average over five replicate simulations. We show observations for ARGON, msprime’s DTWF implementation (“DTWF”) and a hybrid simulation of 100 generations of the DTWF followed by the standard coalescent with recombination (“Hybrid”). We ran ARGON with a mutation rate of 0 and with minimum output options, with a goal of measuring only ancestry simulation time. Memory usage for msprime’s DTWF and hybrid simulations are very similar.

**Figure 8**
Running time of sim_ancestry for (A) small and (B) larger simulations on an Intel i7-6600U CPU. Each point is the run time of one simulation, for various values of effective population size (*N_e*), chromosome length in Morgans (L), and number of diploid samples (n). Run time scales quadratically with the product of *N_e* and L, shown on the horizontal axis. For example, (A) shows that 1000 samples of 1 Morgan-length chromosomes from a population of $N_{e} = 2, 000$ diploids would take about 2 s, and (equivalently) that the same number of 0.01 Morgan segments with $N_{e} = 200, 000$ would take the same time. Since recombination rate in these simulations was $10^{- 8}$ , L is the number of base pairs divided by 10⁸. The black lines are quadratic fits separately in each panel and sample size. Vertical gray lines show the approximate values of $N_{e} L$ for chromosome 1 in three species, using values from the stdpopsim catalog (Adrion *et al.* 2020a).

See this image and copyright information in PMC

References

1. Adrion JR, Cole CB, Dukler N, Galloway JG, Gladstein AL, et al. 2020a. A community-maintained standard library of population genetic models. Elife. 9:e54967. - PMC - PubMed
1. Adrion JR, Galloway JG, Kern AD.. 2020b. Predicting the landscape of recombination using deep learning. Mol Biol Evol. 37:1790–1808. - PMC - PubMed
1. Arenas M. 2012. Simulation of molecular data under diverse evolutionary scenarios. PLoS Comput Biol. 8:e1002495. - PMC - PubMed
1. Arenas M, Posada D.. 2007. Recodon: coalescent simulation of coding DNA sequences with recombination, migration and demography. BMC Bioinformatics. 8:458. - PMC - PubMed
1. Árnason E. 2004. Mitochondrial cytochrome b DNA variation in the high-fecundity Atlantic cod: trans-Atlantic clines and shallow gene genealogy. Genetics. 166:1871–1885. - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- H1 Connect - Access expert opinions and insights on biomedical research.

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Efficient ancestry and mutation simulation with msprime 1.0

Affiliations

Efficient ancestry and mutation simulation with msprime 1.0

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources