Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011 Feb 1;27(3):334-42.
doi: 10.1093/bioinformatics/btq665. Epub 2010 Dec 9.

Mugsy: fast multiple alignment of closely related whole genomes

Affiliations

Mugsy: fast multiple alignment of closely related whole genomes

Samuel V Angiuoli et al. Bioinformatics. .

Abstract

Motivation: The relative ease and low cost of current generation sequencing technologies has led to a dramatic increase in the number of sequenced genomes for species across the tree of life. This increasing volume of data requires tools that can quickly compare multiple whole-genome sequences, millions of base pairs in length, to aid in the study of populations, pan-genomes, and genome evolution.

Results: We present a new multiple alignment tool for whole genomes named Mugsy. Mugsy is computationally efficient and can align 31 Streptococcus pneumoniae genomes in less than 2 hours producing alignments that compare favorably to other tools. Mugsy is also the fastest program evaluated for the multiple alignment of assembled human chromosome sequences from four individuals. Mugsy does not require a reference sequence, can align mixtures of assembled draft and completed genome data, and is robust in identifying a rich complement of genetic variation including duplications, rearrangements, and large-scale gain and loss of sequence.

Availability: Mugsy is free, open-source software available from http://mugsy.sf.net.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
The process flow and primary steps of Mugsy. The key steps are listed in boxes and data types that are input and output at each step are shown adjacent to the arrows. Software used to implement parts of each step is listed on the left. The execution time of each step from an alignment of 4 human chromosomes is provided on the right. The component timings include parsing input and writing outputs. Tests were run on a single CPU of an Intel Xeon 5570 processor with 16 GB of RAM.
Fig. 2.
Fig. 2.
Generation of multi-genome anchors from connected components in the alignment graph. Three sequences are shown (S1, S2, S3) with matching segments from the alignment graph (top). Connected components define three multi-genome anchors (bottom). Adjacent anchors along a sequence are connected by edges and labeled with the sequence identifier. To handle inconsistencies in the alignment graph, connected components are built in a greedy fashion traversing the most consistent edges first and restricting anchors to one alignment segment per genome (data not shown). Multiple segments from the same genome are allowed only if they are within a configurable distance along the sequence.
Fig. 3.
Fig. 3.
Identification of LCBs in the anchor graph. A set of multi-genome anchors labeled A–G are shown. Anchors adjacent along one or more sequences are connected by an edge. (a) Simple paths with exactly one incoming and outgoing edge correspond to collinear regions and branches correspond to syntenic breakpoints (dotted edges) resulting in three collinear regions colored blue, orange, green. (b) Merging of adjacent regions. A short component (D, E) with a genomic extent less than a configurable parameter L is removed from the graph. The remaining anchors form a single collinear region colored blue. (c) Cutting of paths that violate LCBs constraints with max-flow, min-cut. Anchors B and E are adjacent but non-syntenic separated by a genomic extent greater than the configurable parameter G in at least one sequence. The graph forms a single connected component that is an invalid LCB. To resolve this, the anchor graph is interpreted as a flow network. Edges are labeled with an edge capacity indicating the number of sequences for which the incident anchors are collinear. Source and sink vertices (grey) are added to the graph incident to vertices that violate the distance criteria. Maximum flow, minimum cut identifies the cut (dotted edge B, C) to produce two collinear regions colored blue and green. Max-flow, min-cut ensures the graph is cut to produce collinear regions that fulfill the distance constraint G regardless of cycles or branches in the graph.
Fig. 4.
Fig. 4.
Percent identity plots from the Mugsy multiple alignment of human chromosome 1 sequences of four individuals. The alignments span 99.9% of the nucleotides on chromosome 1 of the NCBI reference hg19, excluding the centromere, which is shown as a gap in the middle of the figure. The plots were obtained from the alignment viewer, GMAJ, using hg19 as reference for the display (top coordinates). A percent identity plot is displayed in subsequent rows for each of the three other genomes SJK, YH, JCV. The percent identity in each row ranges from 50 to 100 from the bottom to top of each row.

References

    1. Ahn SM, et al. The first Korean genome sequence and analysis: full genome sequencing for a socio-ethnic group. Genome Res. 2009;19:1622–1629. - PMC - PubMed
    1. Batzoglou S. The many faces of sequence alignment. Brief Bioinform. 2005;6:6–22. - PubMed
    1. Blanchette M, et al. Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res. 2004;14:708–715. - PMC - PubMed
    1. Bourque G, et al. Reconstructing the genomic architecture of ancestral mammals: lessons from human, mouse, and rat genomes. Genome Res. 2004;14:507–516. - PMC - PubMed
    1. Bradley RK, et al. Fast statistical alignment. PLoS Comput Biol. 2009;5:e1000392. - PMC - PubMed

Publication types