Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Dec;24(12):2077-89.
doi: 10.1101/gr.174920.114. Epub 2014 Oct 1.

Alignathon: a competitive assessment of whole-genome alignment methods

Affiliations

Alignathon: a competitive assessment of whole-genome alignment methods

Dent Earl et al. Genome Res. 2014 Dec.

Abstract

Multiple sequence alignments (MSAs) are a prerequisite for a wide variety of evolutionary analyses. Published assessments and benchmark data sets for protein and, to a lesser extent, global nucleotide MSAs are available, but less effort has been made to establish benchmarks in the more general problem of whole-genome alignment (WGA). Using the same model as the successful Assemblathon competitions, we organized a competitive evaluation in which teams submitted their alignments and then assessments were performed collectively after all the submissions were received. Three data sets were used: Two were simulated and based on primate and mammalian phylogenies, and one was comprised of 20 real fly genomes. In total, 35 submissions were assessed, submitted by 10 teams using 12 different alignment pipelines. We found agreement between independent simulation-based and statistical assessments, indicating that there are substantial accuracy differences between contemporary alignment tools. We saw considerable differences in the alignment quality of differently annotated regions and found that few tools aligned the duplications analyzed. We found that many tools worked well at shorter evolutionary distances, but fewer performed competitively at longer distances. We provide all data sets, submissions, and assessment programs for further study and provide, as a resource for future benchmarking, a convenient repository of code and data for reproducing the simulation assessments.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
The phylogenies of the three test sets: primate simulation, mammal simulation, and real fly data set. Branch lengths are in units of neutral substitutions per site.
Figure 2.
Figure 2.
Simulated primate and mammal F-score results. Recall as a function of precision is shown for primates (A) and mammals (C). GenomeMatch-3 is omitted from plot A because both of its values are low (see its overall F-score in B). (B) The primate F-score results isolated to different annotation types: overall, genes, neutral and repetitive regions. (D) The mammal version of C. Legends for B and D are ordered as in the overall category and this order is maintained in the genes and neutral annotations.
Figure 3.
Figure 3.
Primate (A) and mammal (B) simulation F-score results stratified by phylogenetic distance. For each subplot, the vertical axis shows the F-score and the horizontal axis shows 13 individual submissions ordered from left to right (descending) by average overall F-score. Horizontal gray lines show the overall F-score of the submission, taking into account all sequence pairs. Horizontal black lines show the overall F-score of the submission, taking into account only sequence pairs including the reference. Submissions are comprised of points connected by a line where the points are in ascending order of phylogenetic distance (all possible pairs are shown).
Figure 4.
Figure 4.
Simulated mammal results comparing simulation values to statistical values. Shown are precision and PSAR; recall and coverage; F-score and pseudo F-score. Each column represents the results of one submission; columns are in descending order of overall (full genome) F-score value. The horizontal line is, respectively, the overall precision, recall, or F-score value; the upward triangle with a vertical line is the regional precision, recall, or F-score mean value, ± the regional standard deviation; the downward triangle with a vertical line is the PSAR-precision, coverage, or pseudo F-score mean value, ± standard deviation for values that were computed using regional subalignments.
Figure 5.
Figure 5.
Fly results: values of PSAR-precision; average overall coverage between all pairs; pseudo F-score. Columns are in descending order of mean pseudo F-score value. For each metric, each submission is made up of a downward triangle with a vertical line representing the regional mean ± SD.
Figure 6.
Figure 6.
Overall pairwise coverage values in the fly data set. Submissions are ordered left to right (descending) by overall coverage. Gray points are nonreference pairs, and black points contain the reference. The horizontal gray line shows the average coverage of the submission for all points, and the horizontal black line shows the average coverage of the submission just for pairs containing the reference. Beneath the pairwise coverage plot is a barcode plot showing the phylogenetic distances of all pairs. Shorter gray lines are nonreference pairs and longer black lines are reference-containing pairs.
Figure 7.
Figure 7.
Region 2 of D. melanogaster (dm3) with respect to D. grimshawi (droGri2) of the regional analysis of the mammal simulation data set. Region 2 is defined as bases 12,450,223–12,950,222 of dm3 chromosome 3R (horizontal axis). Rows are as follows: the relative abundance of genes within the region; the relative abundance of repetitive sequence in the region; and submissions in descending order of average pseudo F-score. Each submission row shows the pseudo F-score of the submission in black. The vertical axis of each row uses the same scale as shown in the bottom row. The pseudo F-score value of the top submission for this region (TBA) is shown in gray in the background.
Figure 8.
Figure 8.
The Jaccard distance (1 − Jaccard similarity coefficient) matrix and accompanying hierarchical clustering (UPGMA) of submissions for each of the three test sets. (A) Primate Jaccard distance; (B) mammal Jaccard distance; (C) fly Jaccard distance. Higher values indicate that the sets of aligned pairs of two submissions are more dissimilar, and lower values indicate similarity.

References

    1. Angiuoli SV, Salzberg SL. 2011. Mugsy: fast multiple alignment of closely related whole genomes. Bioinformatics 27: 334–342. - PMC - PubMed
    1. Beiko RG, Charlebois RL. 2007. A simulation test bed for hypotheses of genome evolution. Bioinformatics 23: 825–831. - PubMed
    1. Beitzel SM. 2006. “On understanding and classifying web queries.” PhD thesis, Illinois Institute of Technology, Chicago, Illinois.
    1. Benson G. 1999. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res 27: 573–580. - PMC - PubMed
    1. Blackshields G, Wallace IM, Larkin M, Higgins DG. 2006. Analysis and comparison of benchmarks for multiple sequence alignments. In Silico Biol 6: 321–339. - PubMed

Publication types