Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011 May;60(3):291-302.
doi: 10.1093/sysbio/syr010. Epub 2011 Mar 23.

Performance, accuracy, and Web server for evolutionary placement of short sequence reads under maximum likelihood

Affiliations

Performance, accuracy, and Web server for evolutionary placement of short sequence reads under maximum likelihood

Simon A Berger et al. Syst Biol. 2011 May.

Abstract

We present an evolutionary placement algorithm (EPA) and a Web server for the rapid assignment of sequence fragments (short reads) to edges of a given phylogenetic tree under the maximum-likelihood model. The accuracy of the algorithm is evaluated on several real-world data sets and compared with placement by pair-wise sequence comparison, using edit distances and BLAST. We introduce a slow and accurate as well as a fast and less accurate placement algorithm. For the slow algorithm, we develop additional heuristic techniques that yield almost the same run times as the fast version with only a small loss of accuracy. When those additional heuristics are employed, the run time of the more accurate algorithm is comparable with that of a simple BLAST search for data sets with a high number of short query sequences. Moreover, the accuracy of the EPA is significantly higher, in particular when the sample of taxa in the reference topology is sparse or inadequate. Our algorithm, which has been integrated into RAxML, therefore provides an equally fast but more accurate alternative to BLAST for tree-based inference of the evolutionary origin and composition of short sequence reads. We are also actively developing a Web server that offers a freely available service for computing read placements on trees using the EPA.

PubMed Disclaimer

Figures

F<sc>IGURE</sc> 1
FIGURE 1
Local optimization of edge lengths for the insertion of a QS into the RT.
F<sc>IGURE</sc> 2
FIGURE 2
Evolutionary identification of three QSs (QS0, QS1, QS2) using a 4-taxon RT.
F<sc>IGURE</sc> 3
FIGURE 3
Phylogenetic placement of three QSs (QS0, QS1, QS2) onto a 4-taxon RT with insertion support score.
F<sc>IGURE</sc> 4
FIGURE 4
Illustration of the criterion for the QS selection and experimental setup. (a) Candidate QS belongs to subtree of size 2 that is connected to the tree by a well-supported edge. It has one other tip as direct neighbor. QS with this property will be referred to as outer QS. (b) Candidate QS is connected to the tree by two well-supported edges. QS with this property will be referred to as inner QS. (c) Experimental setting: reinsert shortened candidate QS in to pruned RT. We use three different ways of generating QS with desirable features: contiguous subsequences, paired-end reads, and random gaps.
F<sc>IGURE</sc> 5
FIGURE 5
Illustration of the tree-based distance measures. a) Example tree with two edges (original and insertion edge) highlighted. There are two nodes on the path, so the ND is 2. The edge distance corresponds to the length of the connecting path, where of the two end edges only half of the edge length is used. b) Tree diameter that is used to normalize the edge distance.
F<sc>IGURE</sc> 6
FIGURE 6
Placement accuracy for QS with artificially introduced random gaps. a) Average ND and b) NED (between insertion positions and real positions).
F<sc>IGURE</sc> 7
FIGURE 7
Histogram showing the placement accuracy, based on NDs, for the placement of 2 × 100 bp paired-end reads from D855, using outer a) and inner b) QS.
F<sc>IGURE</sc> 8
FIGURE 8
Average ND for different versions of the EPA (fast/slow insertions) algorithm and model types (GTR + Γ, GTR + CAT) on inner QS and outer QS from all data sets.
F<sc>IGURE</sc> 9
FIGURE 9
Accuracy of the EPA as a function of the number of edges considered for slow insertion after heuristic filtering.

Similar articles

Cited by

References

    1. Ababneh F, Jermiin LS, Ma C, Robinson J. Matched-pairs tests of homogeneity with applications to homologous nucleotide sequences. Bioinformatics. 2006;22:1225–1231. - PubMed
    1. Altschul S, Madden T, Schaffer A, Zhang J, Zhang Z, Miller W, Lipman D. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389. - PMC - PubMed
    1. Berger SA, Stamatakis A. Accuracy of morphology-based phylogenetic fossil placement under maximum likelihood. Proceedings of IEEE/ACS International Conference on Computer Systems and Applications (AICCSA-10); 2010 May 16–18; Hammamet, Tunisia: IEEE Computer Society. p. 1–8. 2010
    1. Bininda-Emonds ORP, Brady SG, Sanderson MJ, Kim J. Scaling of accuracy in extremely large phylogenetic trees. In: Altman RB, Dunker AK, Hunter L, Lauderdale K, Klein TE, editors. Pacific Symposium on Biocomputing 2001. River Edge (NJ): World Scientific; 2000. pp. 547–558. - PubMed
    1. Brady A, Salzberg SL. Phymm and PhymmBL: metagenomic phylogenetic classification with interpolated Markov models. Nat. Methods. 2009;6:673–676. - PMC - PubMed

Publication types