Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 May;22(5):377-86.
doi: 10.1089/cmb.2014.0156. Epub 2014 Dec 30.

PASTA: Ultra-Large Multiple Sequence Alignment for Nucleotide and Amino-Acid Sequences

Affiliations

PASTA: Ultra-Large Multiple Sequence Alignment for Nucleotide and Amino-Acid Sequences

Siavash Mirarab et al. J Comput Biol. 2015 May.

Abstract

We introduce PASTA, a new multiple sequence alignment algorithm. PASTA uses a new technique to produce an alignment given a guide tree that enables it to be both highly scalable and very accurate. We present a study on biological and simulated data with up to 200,000 sequences, showing that PASTA produces highly accurate alignments, improving on the accuracy and scalability of the leading alignment methods (including SATé). We also show that trees estimated on PASTA alignments are highly accurate--slightly better than SATé trees, but with substantial improvements relative to other methods. Finally, PASTA is faster than SATé, highly parallelizable, and requires relatively little memory.

Keywords: algorithms; metagenomics; molecular evolution; multiple alignment; phylogenetic trees.

PubMed Disclaimer

Figures

<b>FIG. 1.</b>
FIG. 1.
Algorithmic design of PASTA. The first six boxes show the steps involved in one iteration of PASTA. The last two boxes show the meaning of transitivity for homologies defined by a column of an MSA, and how the concept of transitivity can be used to merge two compatible and overlapping alignments. MSA, multiple sequence alignment.
<b>FIG. 2.</b>
FIG. 2.
Tree error rates on nucleotide datasets. We show missing branch (also known as false negative or FN) rates for maximum likelihood trees estimated on the reference alignment as well as alignments computed using PASTA and other methods; results not shown indicate failure to complete within 24 hr using 12 cores on the datasets. Error bars show standard error over 10 replicates for all model conditions of the Indelible and the 10,000-sequence RNASim datasets.
<b>FIG. 3.</b>
FIG. 3.
Alignment running time (hours). Note that PASTA was run for three iterations everywhere, except on the 100,000-sequence RNASim dataset where it was run for two iterations, and on the 200,000-sequence RNASim dataset where it was run for one iteration. Mafft was run in default mode, except for the 100,000-sequences where PartTree was used.
<b>FIG. 4.</b>
FIG. 4.
Running time comparison of PASTA and SATé-II. (a) Running time profiling on one iteration for RNASim datasets with 10K and 50K sequences (the dotted region indicates the last pairwise merge); (b) running time for one iteration of PASTA with 12 CPUs as a function of the number of sequences (the solid line is fitted to the first two points); and (c) scalability for PASTA and SATé-II with increased number of CPUs.

References

    1. Boisseau J., and Stanzione D.2013. TACC: Texas Advanced Computing Center. Available at: www.tacc.utexas.edu
    1. Cannone J.J., Subramanian S., Schnare M.N., et al. . 2002. The comparative RNA web (CRW) site: an online database of comparative sequence and structure information for ribosomal, intron, and other RNAs. BMC Bioinformatics 3, 2. - PMC - PubMed
    1. Eddy S.2009. A new generation of homology search tools based on probabilistic inference. Genome Informatics 23, 205211 - PubMed
    1. Finn R., Clements J., and Eddy S.2011. HMMER web server: interactive sequence similarity searching. Nucleic Acids Research 39, W29–W37 - PMC - PubMed
    1. Fletcher W., and Yang Z.2009. Indelible: A flexible simulator of biological sequence evolution. Mol. Bio. Evol. 26, 1879–1888 - PMC - PubMed

Publication types

LinkOut - more resources