. 2011 Jun 30:12:272.

doi: 10.1186/1471-2105-12-272.

Improving pan-genome annotation using whole genome multiple alignment

Samuel V Angiuoli¹, Julie C Dunning Hotopp, Steven L Salzberg, Hervé Tettelin

Affiliations

PMID: 21718539
PMCID: PMC3142524
DOI: 10.1186/1471-2105-12-272

Improving pan-genome annotation using whole genome multiple alignment

Samuel V Angiuoli et al. BMC Bioinformatics. 2011.

. 2011 Jun 30:12:272.

doi: 10.1186/1471-2105-12-272.

Authors

Samuel V Angiuoli¹, Julie C Dunning Hotopp, Steven L Salzberg, Hervé Tettelin

Affiliation

¹ Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD 20742, USA. angiuoli@umiacs.umd.edu

PMID: 21718539
PMCID: PMC3142524
DOI: 10.1186/1471-2105-12-272

Abstract

Background: Rapid annotation and comparisons of genomes from multiple isolates (pan-genomes) is becoming commonplace due to advances in sequencing technology. Genome annotations can contain inconsistencies and errors that hinder comparative analysis even within a single species. Tools are needed to compare and improve annotation quality across sets of closely related genomes.

Results: We introduce a new tool, Mugsy-Annotator, that identifies orthologs and evaluates annotation quality in prokaryotic genomes using whole genome multiple alignment. Mugsy-Annotator identifies anomalies in annotated gene structures, including inconsistently located translation initiation sites and disrupted genes due to draft genome sequencing or pseudogenes. An evaluation of species pan-genomes using the tool indicates that such anomalies are common, especially at translation initiation sites. Mugsy-Annotator reports alternate annotations that improve consistency and are candidates for further review.

Conclusions: Whole genome multiple alignment can be used to efficiently identify orthologs and annotation problem areas in a bacterial pan-genome. Comparisons of annotated gene structures within a species may show more variation than is actually present in the genome, indicating errors in genome annotation. Our new tool Mugsy-Annotator assists re-annotation efforts by highlighting edits that improve annotation consistency.

PubMed Disclaimer

Figures

**Figure 1**
**Identifying orthologs and comparing gene structures in a pan-genome using whole genome multiple alignments**. The input is provided as a set of genomic sequences (FASTA format) and gene annotations (GFF3 format). Whole genome multiple alignments (top left) are first calculated using Mugsy [16]. Mugsy-Annotator then builds groups of orthologous gene structures that are conserved in sequence and genomic context according to the alignment. The alignment also indicates the location of each predicted translation initiation start and stop across the genomes, allowing for identification of annotation anomalies or missing annotations.

**Figure 2**
**Annotation anomalies identified by Mugsy-Annotator**. Four classes of anomalies are shown (a-d). On the right, examples of aligned genes are drawn with the boxed region indicating the location of the anomaly. On the left, a multiple alignment is depicted across the highlighted region with sequence identity indicated by dots. In (c), a gap indicated by a dash introduces a shift in reading frame that results in use of a termination codon that is inconsistent with the annotations in the other genomes. Translation initiation sites are marked as "start" and termination codons are marked as "stop" with an arrow indicating the direction of translation.

**Figure 3**
**Comparison of genes reported in orthology groups from Mugsy-Annotator and OrthoMCL**. The intersection between Mugsy-Annotator and OrthoMCL reports the number of genes reported in ortholog groups by both methods. The remainder for Mugsy-Annotator and OrthoMCL reports the number of genes classified in ortholog groups by one of the methods only.

**Figure 4**
**Distribution of the number of genomes in ortholog groups identified by Mugsy-Annotator for 20 *Nmen* genomes**. The number of genomes per orthology groups are provided for all orthology groups (top), consistently annotated groups only (middle), and exclusively groups with annotation inconsistencies (bottom).

**Figure 5**
**Consistency of annotated gene structures in several species pan-genomes as reported by Mugsy-Annotator**. Each row provides the fraction of aligned gene sets in each class of anomaly and groups with no identified inconsistencies (blue). The number of genomes compared and their average MUMi similarity [23] distance is also provided, ranging from zero for most similar to 1, least similar. The bottom three rows describe three versions of annotations from the case study of *Neisseria meningitidis (Nmen)* . The last version (*Nmen* verC) demonstrates improvements in consistency using alternative annotations suggested by Mugsy-Annotator.

**Figure 6**
**Annotation anomalies caused by a single genome**. Each row provides a count of ortholog groups where the named genome is inconsistent with the remaining genomes in the group. In these cases, the annotated translation initiation site in the named genome in *Nmen* verB did not match any of the other annotated gene structures in the ortholog groups.

**Figure 7**
**Distance of alternative TIS from the annotated site**. Distance between the annotated translation initiation site and the most consistent translation initiation site reported by Mugsy-Annotator.

See this image and copyright information in PMC

References

1. Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW. GenBank. Nucleic Acids Res. 2011;39:D32–37. doi: 10.1093/nar/gkq1079. - DOI - PMC - PubMed
1. Hyatt D, Chen GL, Locascio PF, Land ML, Larimer FW, Hauser LJ. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics. 2010;11:119. doi: 10.1186/1471-2105-11-119. - DOI - PMC - PubMed
1. Delcher AL, Bratke KA, Powers EC, Salzberg SL. Identifying bacterial genes and endosymbiont DNA with Glimmer. Bioinformatics. 2007;23:673–679. doi: 10.1093/bioinformatics/btm009. - DOI - PMC - PubMed
1. Lukashin AV, Borodovsky M. GeneMark.hmm: new solutions for gene finding. Nucleic Acids Res. 1998;26:1107–1115. doi: 10.1093/nar/26.4.1107. - DOI - PMC - PubMed
1. Nielsen P, Krogh A. Large-scale prokaryotic gene prediction and comparison to genome annotation. Bioinformatics. 2005;21:4322–4329. doi: 10.1093/bioinformatics/bti701. - DOI - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Improving pan-genome annotation using whole genome multiple alignment

Affiliation

Improving pan-genome annotation using whole genome multiple alignment

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources