Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 May 1;35(5):1253-1265.
doi: 10.1093/molbev/msy020.

Using Genotype Abundance to Improve Phylogenetic Inference

Affiliations

Using Genotype Abundance to Improve Phylogenetic Inference

William S DeWitt 3rd et al. Mol Biol Evol. .

Abstract

Modern biological techniques enable very dense genetic sampling of unfolding evolutionary histories, and thus frequently sample some genotypes multiple times. This motivates strategies to incorporate genotype abundance information in phylogenetic inference. In this article, we synthesize a stochastic process model with standard sequence-based phylogenetic optimality, and show that tree estimation is substantially improved by doing so. Our method is validated with extensive simulations and an experimental single-cell lineage tracing study of germinal center B cell receptor affinity maturation.

PubMed Disclaimer

Figures

<sc>Fig</sc>. 1.
Fig. 1.
Genotype-collapsed trees. (a) A diversifying B cell lineage is illustrated with distinct BCR genotypes colored. The final observed cells (enclosed by a dashed path) consist of genotypes at various abundances; note the yellow genotype is not observed. (b) The corresponding genotype-collapsed tree (GCtree) describes the descent of distinct genotypes, and is our inferential goal. (c) Genotype abundance informs topology inference. Two hypothetical GCtrees, equally optimal with respect to the sequence data, propose two possible parents of the green genotype—the gray and yellow genotypes (the yellow genotype was not sampled and thus has a small circle with no number inside). Intuitively, the abundance information indicates that the tree on the left is preferable because the more abundant parent is more likely to have generated mutant descendants.
<sc>Fig</sc>. 2.
Fig. 2.
Modeling sequences equipped with abundances. (a) Both genotype sequence data G and genotype abundance data A inform tree topology T. As illustrated in this probabilistic graphical model, we assume independence between G and A conditioned on T rather than a fully joint model of G, A, and T. This facilitates using standard sequence-based phylogenetic optimality for G, augmented with a branching process (with parameters θ) for A. (b) For the binary infinite-type Galton–Watson process, θ=(p,q). Four possible branching events characterize the offspring distribution common to all nodes. A node may bifurcate (with probability p) or terminate, and upon bifurcating its descendants each may be a mutant (with probability q). (c) A GCtree node specifies a genotype’s clonal leaf count and number of descendant genotypes, but not lineage details. The likelihood of a GCtree node marginalizes over consistent lineage branching outcomes. (d) GCtree likelihood factorizes into the product of likelihoods for each genotype.
<sc>Fig</sc>. 3.
Fig. 3.
In silico validation of GCtree inference. (a) Demonstrating the simulation–inference–validation workflow, a small simulation resulted in two equally maximally parsimonious trees, and the one inferred using GCtree was correct. The initial sequence was a naive BCR V gene from the experimental data described in Materials and Methods. Branch lengths in the cell lineage tree (left) correspond to simulation time steps, while those in collapsed trees correspond to sequence edit distance. (b) About 100 simulations were performed with parameters calibrated using the BCR sequencing data and summary statistics described in Materials and Methods. Of 100 simulations, 66 resulted in parsimony degeneracy, with an average degeneracy of 12 and a maximum degeneracy of 124. For each of these 66, we show the distribution of Robinson–Foulds (RF) distance of trees in the parsimony forest to the true tree. “RF” denotes a modified Robinson–Foulds distance: since nonzero abundance internal nodes in GCtrees represent observed taxa, RF distance was computed as if all such nodes had an additional descendant leaf representing that taxon. GCtree MLEs (red) tend to be better reconstructions of the true tree than other parsimony trees (gray boxes). Four simulations resulted in a tie for the GCtree MLE, and the two tied trees in these cases are both displayed in red. Aggregated data across all simulations are depicted on the right, clearly indicating superior reconstructions from GCtree.
<sc>Fig</sc>. 4.
Fig. 4.
Empirical validation using lineage tracing and single-cell germinal center BCR sequencing. (a) A multiphoton image of a germinal center reveals a dominant blue lineage (scale bar 100 µm). This lineage was sorted, and 48 cells sequenced to determine IgH and IgL genotypes of each. These sequences were analyzed with partis (Ralph and Matsen 2016a, 2016b) to infer naive (preaffinity-maturation) ancestor sequences using germline genetic information, and trees were inferred with GCtree. (b) GCtree inference was performed separately for IgH and IgL loci, resulting in parsimony degeneracies of 13 and 9, respectively. Maximum likelihood GCtrees for each locus are indicated in red and the GCtrees with annotated abundance are shown. Roots are labeled with the gene annotations of the naive state inferred using partis. Small unnumbered nodes indicate inferred unobserved ancestral genotypes. Numbered edges indicate support in 100 bootstrap samples. (c) All possible pairings of IgH and IgL parsimony trees were compared in terms of the Robinson–Foulds distance between the IgH and IgL trees, labeled by cell identity. IgH and IgL parsimony trees are ordered by GCtree likelihood rank in columns and rows, respectively. Grid values show RF distance between each IgH/IgL pair. MLE trees result in more consistent cell lineage reconstructions between IgH and IgL (smaller RF values). (d) For each locus, distributions of bootstrap support values are shown for the tree inferred by GCtree and for a majority rule consensus tree of all trees in the parsimony forest. The latter contain more partitions with very low support. (e) Using additional data from a second germinal center from the same lymph node that had the same naive BCR sequence, GCtree correctly resolves the two germinal centers as distinct clades (as did other lower ranked parsimony trees).

References

    1. Barak M, Zuckerman N, Edelman H, Unger R, Mehr R.. 2008. IgTree (c): creating immunoglobulin variable region gene lineage trees. J Immunol Methods 338(1–2):67–74. - PubMed
    1. Bertoin J. 2009. The structure of the allelic partition of the total population for Galton–Watson processes with neutral mutations. Ann Probab. 374:1502–1523.
    1. Brodin J, Hedskog C, Heddini A, Benard E, Neher RA, Mild M, Albert J.. 2015. Challenges with using primer IDs to improve accuracy of next generation sequencing. PLoS One 103:e0119123.. - PMC - PubMed
    1. Cusanovich DA, Daza R, Adey A, Pliner HA, Christiansen L, Gunderson KL, Steemers FJ, Trapnell C, Shendure J.. 2015. Multiplex single cell profiling of chromatin accessibility by combinatorial cellular indexing. Science 3486237:910–914. - PMC - PubMed
    1. DeWitt WS, Lindau P, Snyder TM, Sherwood AM, Vignali M, Carlson CS, Greenberg PD, Duerkopp N, Emerson RO, Robins HS.. 2016. A public database of memory and naive B-cell receptor sequences. PLoS One 118:e0160853–e0160818. - PMC - PubMed

Publication types