. 2010 Oct 30:11:538.

doi: 10.1186/1471-2105-11-538.

pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree

Frederick A Matsen¹, Robin B Kodner, E Virginia Armbrust

Affiliations

PMID: 21034504
PMCID: PMC3098090
DOI: 10.1186/1471-2105-11-538

pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree

Frederick A Matsen et al. BMC Bioinformatics. 2010.

. 2010 Oct 30:11:538.

doi: 10.1186/1471-2105-11-538.

Authors

Frederick A Matsen¹, Robin B Kodner, E Virginia Armbrust

Affiliation

¹ Computational Biology Program, Fred Hutchinson Cancer Research Center, Seattle, Washington, USA. matsen@fhcrc.org

PMID: 21034504
PMCID: PMC3098090
DOI: 10.1186/1471-2105-11-538

Abstract

Background: Likelihood-based phylogenetic inference is generally considered to be the most reliable classification method for unknown sequences. However, traditional likelihood-based phylogenetic methods cannot be applied to large volumes of short reads from next-generation sequencing due to computational complexity issues and lack of phylogenetic signal. "Phylogenetic placement," where a reference tree is fixed and the unknown query sequences are placed onto the tree via a reference alignment, is a way to bring the inferential power offered by likelihood-based approaches to large data sets.

Results: This paper introduces pplacer, a software package for phylogenetic placement and subsequent visualization. The algorithm can place twenty thousand short reads on a reference tree of one thousand taxa per hour per processor, has essentially linear time and memory complexity in the number of reference taxa, and is easy to run in parallel. Pplacer features calculation of the posterior probability of a placement on an edge, which is a statistically rigorous way of quantifying uncertainty on an edge-by-edge basis. It also can inform the user of the positional uncertainty for query sequences by calculating expected distance between placement locations, which is crucial in the estimation of uncertainty with a well-sampled reference tree. The software provides visualizations using branch thickness and color to represent number of placements and their uncertainty. A simulation study using reads generated from 631 COG alignments shows a high level of accuracy for phylogenetic placement over a wide range of alignment diversity, and the power of edge uncertainty estimates to measure placement confidence.

Conclusions: Pplacer enables efficient phylogenetic placement and subsequent visualization, making likelihood-based phylogenetics methodology practical for large collections of reads; it is freely available as source code, binaries, and a web service.

PubMed Disclaimer

Figures

**Figure 1**
**Example application, showing uncertainty**. Pplacer example application using *psbA* reference sequences and the corresponding recruited Global Ocean Sampling [4] (GOS) sequences showing both number of placements and their uncertainty. Branch thickness is a linear function of the log-transformed number of placements on that edge, and branch color represents average uncertainty (more red implies more uncertain, with yellow denoting EDPL above a user-defined limit). The upper panel shows the *Prochlorococcus* clade of the tree. The lower panel shows a portion of the tree with substantial uncertainty using the EDPL metric. Placeviz output viewed using Archaeopteryx [32].

**Figure 2**
**Linear time dependence on number of reference taxa**. Time to place 10,000 16 s rRNA reads of median length 198 nt onto a reference phylogenetic tree, with a 1287 nt reference alignment. Tests run on an Intel Xeon @ 2.33 Ghz.

**Figure 3**
pplacermemory requirements. Memory required to place 10,000 16 s rRNA reads of median length 198 nt onto a reference phylogenetic tree, with a 1287 nt reference alignment. Tests run on an Intel Xeon @ 2.33 Ghz.

**Figure 4**
**Measuring uncertainty by the expected distance between placement locations (EDPL)**. The Expected Distance between Placement Locations (EDPL) uncertainty metric can indicate if placement uncertainty may pose a problem for downstream analysis. The EDPL uncertainty is the sum of the distances between the optimal placements weighted by their probability (4). The hollow stars on the left side of the tree depict a case where there is considerable uncertainty as to the exact placement edge, but the collection of possible edges all sit in a small region of the tree. This local uncertainty would have a low EDPL score. The full stars on the right side of the diagram would have a large EDPL, as the different placements are spread widely across the tree. Such a situation can be flagged for special treatment or removal.

**Figure 5**
**Example application**. Placement visualization of same results as in Figure 1. The notation "15_at_4", for example, means that 15 sequences were placed at internal edge number 4. These edge numbers can then be used to find the corresponding sequences in the .loc.fasta file. Placeviz output viewed using FigTree [75].

**Figure 6**
**Simulation with 631 COG alignments**. Error analysis from a simulation study using 631 COG alignments. Ten reads were simulated from each taxon of each alignment, and then binned according to the likelihood weight ratio of their best placement; ranges for the four bins are indicated in the legend. There is one scatter point in the plot for each bin of each alignment: the x-axis for each plot shows the number of taxa in the tree used for the simulation, and the y axis shows the average error for that bin. For example, a point at (100, 1.2) labeled 0.5 - 0.75 indicates that the set of all placements for an alignment of 100 taxa with confidence score between 0.5 and 0.75 has average error of 1.2. As described in the text, the error metric is the number of internal nodes between the correct edge and the node placement edge.

**Figure 7**
**Accuracy versus distance to sister taxon: COG simulation**. The relationship between accuracy and phylogenetic (sum of branch length) distance to the sister taxon for the COG simulation. For each taxon in each alignment, the phylogenetic distance to the closest sister taxon was calculated, along with the average placement error for the ten reads simulated from that taxon in that alignment. The results were binned and shown in boxplot form, with the central line showing the median, the box showing the interquartile range, and the "whiskers" showing the extent of values which are with 1.5 times the interquartile range beyond the lower and upper quartiles. Outliers eliminated for clarity.

**Figure 8**
**Speed comparison of** pplacerand RAxML's EPA algorithm. Time to place 10,000 16 s rRNA reads of median length 198 nt onto a reference phylogenetic tree, with a 1287 nt reference alignment. "Γ model" refers to a four-category gamma model of rate heterogeneity [17], and "CAT" is an approximation which chooses a single rate for each site [76]. Tests run on an Intel Xeon @ 2.33 Ghz.

**Figure 9**
**Top placement accuracy comparison of** pplacerand RAxML's EPA algorithm. Accuracy comparison between EPA and pplacer both run with the Γ model of rate variation, using reads of mean length 200 simulated from the test data sets from [28]. The x-axis numbers are the size of the data set used for simulation. The y-axis shows the error for the placement with the highest likelihood score.

**Figure 10**
**Expected accuracy comparison of** pplacerand RAxML's EPA algorithm. Comparison as in Figure 9 but scoring the expected error, i.e. the total error weighted by the likelihood weight ratios.

See this image and copyright information in PMC

Cited by

Phylogenetic affiliation of SSU rRNA genes generated by massively parallel sequencing: new insights into the freshwater protist diversity.
Taib N, Mangot JF, Domaizon I, Bronner G, Debroas D. Taib N, et al. PLoS One. 2013;8(3):e58950. doi: 10.1371/journal.pone.0058950. Epub 2013 Mar 14. PLoS One. 2013. PMID: 23516585 Free PMC article.
Genomic diversity in Paenibacillus polymyxa: unveiling distinct species groups and functional variability.
Wallner A, Antonielli L, Mesguida O, Rey P, Compant S. Wallner A, et al. BMC Genomics. 2024 Jul 25;25(1):720. doi: 10.1186/s12864-024-10610-w. BMC Genomics. 2024. PMID: 39054421 Free PMC article.
Bacterial communities in women with bacterial vaginosis: high resolution phylogenetic analyses reveal relationships of microbiota to clinical criteria.
Srinivasan S, Hoffman NG, Morgan MT, Matsen FA, Fiedler TL, Hall RW, Ross FJ, McCoy CO, Bumgarner R, Marrazzo JM, Fredricks DN. Srinivasan S, et al. PLoS One. 2012;7(6):e37818. doi: 10.1371/journal.pone.0037818. Epub 2012 Jun 18. PLoS One. 2012. PMID: 22719852 Free PMC article.
Metagenomic approaches in microbial ecology: an update on whole-genome and marker gene sequencing analyses.
Pérez-Cobas AE, Gomez-Valero L, Buchrieser C. Pérez-Cobas AE, et al. Microb Genom. 2020 Aug;6(8):mgen000409. doi: 10.1099/mgen.0.000409. Epub 2020 Jul 24. Microb Genom. 2020. PMID: 32706331 Free PMC article. Review.
Utilization of heme as an iron source by marine Alphaproteobacteria in the Roseobacter clade.
Roe KL, Hogle SL, Barbeau KA. Roe KL, et al. Appl Environ Microbiol. 2013 Sep;79(18):5753-62. doi: 10.1128/AEM.01562-13. Epub 2013 Jul 19. Appl Environ Microbiol. 2013. PMID: 23872569 Free PMC article.

See all "Cited by" articles

References

1. Margulies M, Egholm M, Altman W, Attiya S, Bader J, Bemben L, Berka J, Braverman M, Chen Y, Chen Z. et al.Genome sequencing in microfabricated high-density picolitre reactors. Nature. 2005;437:376–380. - PMC - PubMed
1. Culley A, Lang A, Suttle C. Metagenomic analysis of coastal RNA virus communities. Science. 2006;312(5781):1795–1798. doi: 10.1126/science.1127404. - DOI - PubMed
1. Gill S, Pop M, DeBoy R, Eckburg P, Turnbaugh P, Samuel B, Gordon J, Relman D, Fraser-Liggett C, Nelson K. Metagenomic analysis of the human distal gut microbiome. Science. 2006;312(5778):1355–1359. doi: 10.1126/science.1124234. - DOI - PMC - PubMed
1. Venter J, Remington K, Heidelberg J, Halpern A, Rusch D, Eisen J, Wu D, Paulsen I, Nelson K, Nelson W. et al.Environmental genome shotgun sequencing of the Sargasso Sea. Science. 2004;304(5667):66–74. doi: 10.1126/science.1093857. - DOI - PubMed
1. Tringe S, Rubin E. Metagenomics: DNA sequencing of environmental samples. Nat Rev Genet. 2005;6(11):805–814. doi: 10.1038/nrg1709. - DOI - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree

Affiliation

pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Related information

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases