Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010 Oct 30:11:538.
doi: 10.1186/1471-2105-11-538.

pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree

Affiliations

pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree

Frederick A Matsen et al. BMC Bioinformatics. .

Abstract

Background: Likelihood-based phylogenetic inference is generally considered to be the most reliable classification method for unknown sequences. However, traditional likelihood-based phylogenetic methods cannot be applied to large volumes of short reads from next-generation sequencing due to computational complexity issues and lack of phylogenetic signal. "Phylogenetic placement," where a reference tree is fixed and the unknown query sequences are placed onto the tree via a reference alignment, is a way to bring the inferential power offered by likelihood-based approaches to large data sets.

Results: This paper introduces pplacer, a software package for phylogenetic placement and subsequent visualization. The algorithm can place twenty thousand short reads on a reference tree of one thousand taxa per hour per processor, has essentially linear time and memory complexity in the number of reference taxa, and is easy to run in parallel. Pplacer features calculation of the posterior probability of a placement on an edge, which is a statistically rigorous way of quantifying uncertainty on an edge-by-edge basis. It also can inform the user of the positional uncertainty for query sequences by calculating expected distance between placement locations, which is crucial in the estimation of uncertainty with a well-sampled reference tree. The software provides visualizations using branch thickness and color to represent number of placements and their uncertainty. A simulation study using reads generated from 631 COG alignments shows a high level of accuracy for phylogenetic placement over a wide range of alignment diversity, and the power of edge uncertainty estimates to measure placement confidence.

Conclusions: Pplacer enables efficient phylogenetic placement and subsequent visualization, making likelihood-based phylogenetics methodology practical for large collections of reads; it is freely available as source code, binaries, and a web service.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Example application, showing uncertainty. Pplacer example application using psbA reference sequences and the corresponding recruited Global Ocean Sampling [4] (GOS) sequences showing both number of placements and their uncertainty. Branch thickness is a linear function of the log-transformed number of placements on that edge, and branch color represents average uncertainty (more red implies more uncertain, with yellow denoting EDPL above a user-defined limit). The upper panel shows the Prochlorococcus clade of the tree. The lower panel shows a portion of the tree with substantial uncertainty using the EDPL metric. Placeviz output viewed using Archaeopteryx [32].
Figure 2
Figure 2
Linear time dependence on number of reference taxa. Time to place 10,000 16 s rRNA reads of median length 198 nt onto a reference phylogenetic tree, with a 1287 nt reference alignment. Tests run on an Intel Xeon @ 2.33 Ghz.
Figure 3
Figure 3
pplacermemory requirements. Memory required to place 10,000 16 s rRNA reads of median length 198 nt onto a reference phylogenetic tree, with a 1287 nt reference alignment. Tests run on an Intel Xeon @ 2.33 Ghz.
Figure 4
Figure 4
Measuring uncertainty by the expected distance between placement locations (EDPL). The Expected Distance between Placement Locations (EDPL) uncertainty metric can indicate if placement uncertainty may pose a problem for downstream analysis. The EDPL uncertainty is the sum of the distances between the optimal placements weighted by their probability (4). The hollow stars on the left side of the tree depict a case where there is considerable uncertainty as to the exact placement edge, but the collection of possible edges all sit in a small region of the tree. This local uncertainty would have a low EDPL score. The full stars on the right side of the diagram would have a large EDPL, as the different placements are spread widely across the tree. Such a situation can be flagged for special treatment or removal.
Figure 5
Figure 5
Example application. Placement visualization of same results as in Figure 1. The notation "15_at_4", for example, means that 15 sequences were placed at internal edge number 4. These edge numbers can then be used to find the corresponding sequences in the .loc.fasta file. Placeviz output viewed using FigTree [75].
Figure 6
Figure 6
Simulation with 631 COG alignments. Error analysis from a simulation study using 631 COG alignments. Ten reads were simulated from each taxon of each alignment, and then binned according to the likelihood weight ratio of their best placement; ranges for the four bins are indicated in the legend. There is one scatter point in the plot for each bin of each alignment: the x-axis for each plot shows the number of taxa in the tree used for the simulation, and the y axis shows the average error for that bin. For example, a point at (100, 1.2) labeled 0.5 - 0.75 indicates that the set of all placements for an alignment of 100 taxa with confidence score between 0.5 and 0.75 has average error of 1.2. As described in the text, the error metric is the number of internal nodes between the correct edge and the node placement edge.
Figure 7
Figure 7
Accuracy versus distance to sister taxon: COG simulation. The relationship between accuracy and phylogenetic (sum of branch length) distance to the sister taxon for the COG simulation. For each taxon in each alignment, the phylogenetic distance to the closest sister taxon was calculated, along with the average placement error for the ten reads simulated from that taxon in that alignment. The results were binned and shown in boxplot form, with the central line showing the median, the box showing the interquartile range, and the "whiskers" showing the extent of values which are with 1.5 times the interquartile range beyond the lower and upper quartiles. Outliers eliminated for clarity.
Figure 8
Figure 8
Speed comparison of pplacerand RAxML's EPA algorithm. Time to place 10,000 16 s rRNA reads of median length 198 nt onto a reference phylogenetic tree, with a 1287 nt reference alignment. "Γ model" refers to a four-category gamma model of rate heterogeneity [17], and "CAT" is an approximation which chooses a single rate for each site [76]. Tests run on an Intel Xeon @ 2.33 Ghz.
Figure 9
Figure 9
Top placement accuracy comparison of pplacerand RAxML's EPA algorithm. Accuracy comparison between EPA and pplacer both run with the Γ model of rate variation, using reads of mean length 200 simulated from the test data sets from [28]. The x-axis numbers are the size of the data set used for simulation. The y-axis shows the error for the placement with the highest likelihood score.
Figure 10
Figure 10
Expected accuracy comparison of pplacerand RAxML's EPA algorithm. Comparison as in Figure 9 but scoring the expected error, i.e. the total error weighted by the likelihood weight ratios.

Similar articles

Cited by

References

    1. Margulies M, Egholm M, Altman W, Attiya S, Bader J, Bemben L, Berka J, Braverman M, Chen Y, Chen Z. et al.Genome sequencing in microfabricated high-density picolitre reactors. Nature. 2005;437:376–380. - PMC - PubMed
    1. Culley A, Lang A, Suttle C. Metagenomic analysis of coastal RNA virus communities. Science. 2006;312(5781):1795–1798. doi: 10.1126/science.1127404. - DOI - PubMed
    1. Gill S, Pop M, DeBoy R, Eckburg P, Turnbaugh P, Samuel B, Gordon J, Relman D, Fraser-Liggett C, Nelson K. Metagenomic analysis of the human distal gut microbiome. Science. 2006;312(5778):1355–1359. doi: 10.1126/science.1124234. - DOI - PMC - PubMed
    1. Venter J, Remington K, Heidelberg J, Halpern A, Rusch D, Eisen J, Wu D, Paulsen I, Nelson K, Nelson W. et al.Environmental genome shotgun sequencing of the Sargasso Sea. Science. 2004;304(5667):66–74. doi: 10.1126/science.1093857. - DOI - PubMed
    1. Tringe S, Rubin E. Metagenomics: DNA sequencing of environmental samples. Nat Rev Genet. 2005;6(11):805–814. doi: 10.1038/nrg1709. - DOI - PubMed

Publication types