Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 Mar 15;32(6):814-20.
doi: 10.1093/bioinformatics/btv592. Epub 2015 Nov 14.

Using de novo protein structure predictions to measure the quality of very large multiple sequence alignments

Affiliations

Using de novo protein structure predictions to measure the quality of very large multiple sequence alignments

Gearóid Fox et al. Bioinformatics. .

Abstract

Motivation: Multiple sequence alignments (MSAs) with large numbers of sequences are now commonplace. However, current multiple alignment benchmarks are ill-suited for testing these types of alignments, as test cases either contain a very small number of sequences or are based purely on simulation rather than empirical data.

Results: We take advantage of recent developments in protein structure prediction methods to create a benchmark (ContTest) for protein MSAs containing many thousands of sequences in each test case and which is based on empirical biological data. We rank popular MSA methods using this benchmark and verify a recent result showing that chained guide trees increase the accuracy of progressive alignment packages on datasets with thousands of proteins.

Availability and implementation: Benchmark data and scripts are available for download at http://www.bioinf.ucd.ie/download/ContTest.tar.gz

Contact: des.higgins@ucd.ie

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Flowchart of benchmark process for one test case. Sequences from Pfam are aligned with the method of interest and the resulting MSA is used to predict residue–residue contacts for one of the proteins in the alignment. The 3D coordinates of this target protein are used to calculate the true residue–residue contacts. The two lists of contacts are compared to calculate a score for the alignment
Fig. 2.
Fig. 2.
Comparison of Kalign 2 and Clustal Omega guide tree imbalance. The Sackin score (sum of distances from leaves to root) produced by each program is plotted against the number of sequences in the alignment for each test case. Values for fully chained and balanced trees and expected values under the Equal Rates Markov and Proportional to Distinguishable Arrangements models of tree growth are indicated with lines
Fig. 3.
Fig. 3.
Introducing random misalignments decreases the benchmark score. Each boxplot represents 20 replicates where a different random subset of sequences is misaligned. There is a strong correlation between more errors and decreasing benchmark score

References

    1. Berman H.M., et al. (2000) The protein data bank. Nucleic Acids Res., 28, 235–242. - PMC - PubMed
    1. Blackshields G., et al. (2010) Sequence embedding for fast construction of guide trees for multiple sequence alignment. Algorithms Mol. Biol., 5, 21. - PMC - PubMed
    1. Boyce K., et al. (2014) Simple chained guide trees give high-quality protein multiple sequence alignments. Proc. Natl Acad. Sci. USA, 111, 10556–10561. - PMC - PubMed
    1. Boyce K., et al. (2015) Reply to Tan et al.: differences between real and simulated proteins in multiple sequence alignments: Fig. 1. Proc. Natl Acad. Sci. USA, 112, E101. - PMC - PubMed
    1. Carlson M., et al. (n.d.) PFAM.db: A Set of Protein ID Mappings for PFAM. R package version 3.1.2. http://bioconductor.org/packages/release/data/annotation/html/PFAM.db.html.