Using de novo protein structure predictions to measure the quality of very large multiple sequence alignments

Gearóid Fox¹, Fabian Sievers¹, Desmond G Higgins¹

Affiliations

PMID: 26568625
PMCID: PMC5939968
DOI: 10.1093/bioinformatics/btv592

Using de novo protein structure predictions to measure the quality of very large multiple sequence alignments

Gearóid Fox et al. Bioinformatics. 2016.

. 2016 Mar 15;32(6):814-20.

doi: 10.1093/bioinformatics/btv592. Epub 2015 Nov 14.

Authors

Gearóid Fox¹, Fabian Sievers¹, Desmond G Higgins¹

Affiliation

¹ Conway Institute of Biomolecular and Biomedical Research, and UCD School of Medicine and Medical Science, University College Dublin, Dublin 4, Ireland.

PMID: 26568625
PMCID: PMC5939968
DOI: 10.1093/bioinformatics/btv592

Abstract

Motivation: Multiple sequence alignments (MSAs) with large numbers of sequences are now commonplace. However, current multiple alignment benchmarks are ill-suited for testing these types of alignments, as test cases either contain a very small number of sequences or are based purely on simulation rather than empirical data.

Results: We take advantage of recent developments in protein structure prediction methods to create a benchmark (ContTest) for protein MSAs containing many thousands of sequences in each test case and which is based on empirical biological data. We rank popular MSA methods using this benchmark and verify a recent result showing that chained guide trees increase the accuracy of progressive alignment packages on datasets with thousands of proteins.

Availability and implementation: Benchmark data and scripts are available for download at http://www.bioinf.ucd.ie/download/ContTest.tar.gz

Contact: des.higgins@ucd.ie

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

**Fig. 1.**
Flowchart of benchmark process for one test case. Sequences from Pfam are aligned with the method of interest and the resulting MSA is used to predict residue–residue contacts for one of the proteins in the alignment. The 3D coordinates of this target protein are used to calculate the true residue–residue contacts. The two lists of contacts are compared to calculate a score for the alignment

**Fig. 2.**
Comparison of Kalign 2 and Clustal Omega guide tree imbalance. The Sackin score (sum of distances from leaves to root) produced by each program is plotted against the number of sequences in the alignment for each test case. Values for fully chained and balanced trees and expected values under the Equal Rates Markov and Proportional to Distinguishable Arrangements models of tree growth are indicated with lines

**Fig. 3.**
Introducing random misalignments decreases the benchmark score. Each boxplot represents 20 replicates where a different random subset of sequences is misaligned. There is a strong correlation between more errors and decreasing benchmark score

See this image and copyright information in PMC

References

1. Berman H.M., et al. (2000) The protein data bank. Nucleic Acids Res., 28, 235–242. - PMC - PubMed
1. Blackshields G., et al. (2010) Sequence embedding for fast construction of guide trees for multiple sequence alignment. Algorithms Mol. Biol., 5, 21. - PMC - PubMed
1. Boyce K., et al. (2014) Simple chained guide trees give high-quality protein multiple sequence alignments. Proc. Natl Acad. Sci. USA, 111, 10556–10561. - PMC - PubMed
1. Boyce K., et al. (2015) Reply to Tan et al.: differences between real and simulated proteins in multiple sequence alignments: Fig. 1. Proc. Natl Acad. Sci. USA, 112, E101. - PMC - PubMed
1. Carlson M., et al. (n.d.) PFAM.db: A Set of Protein ID Mappings for PFAM. R package version 3.1.2. http://bioconductor.org/packages/release/data/annotation/html/PFAM.db.html.

MeSH terms

Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Using de novo protein structure predictions to measure the quality of very large multiple sequence alignments

Affiliation

Using de novo protein structure predictions to measure the quality of very large multiple sequence alignments

Authors

Affiliation

Abstract

Figures

References

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources