Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011 Dec 22:12:491.
doi: 10.1186/1471-2105-12-491.

MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects

Affiliations

MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects

Carson Holt et al. BMC Bioinformatics. .

Abstract

Background: Second-generation sequencing technologies are precipitating major shifts with regards to what kinds of genomes are being sequenced and how they are annotated. While the first generation of genome projects focused on well-studied model organisms, many of today's projects involve exotic organisms whose genomes are largely terra incognita. This complicates their annotation, because unlike first-generation projects, there are no pre-existing 'gold-standard' gene-models with which to train gene-finders. Improvements in genome assembly and the wide availability of mRNA-seq data are also creating opportunities to update and re-annotate previously published genome annotations. Today's genome projects are thus in need of new genome annotation tools that can meet the challenges and opportunities presented by second-generation sequencing technologies.

Results: We present MAKER2, a genome annotation and data management tool designed for second-generation genome projects. MAKER2 is a multi-threaded, parallelized application that can process second-generation datasets of virtually any size. We show that MAKER2 can produce accurate annotations for novel genomes where training-data are limited, of low quality or even non-existent. MAKER2 also provides an easy means to use mRNA-seq data to improve annotation quality; and it can use these data to update legacy annotations, significantly improving their quality. We also show that MAKER2 can evaluate the quality of genome annotations, and identify and prioritize problematic annotations for manual review.

Conclusions: MAKER2 is the first annotation engine specifically designed for second-generation genome projects. MAKER2 scales to datasets of any size, requires little in the way of training data, and can use mRNA-seq data to improve annotation quality. It can also update and manage legacy genome annotation datasets.

PubMed Disclaimer

Figures

Figure 1
Figure 1
MAKER2 vs. ab initio predictors on second-generation genomes. We compared the performance of the ab initio predictor SNAP to the annotation pipeline MAKER2 on two second-generation genomes: L. humile (Argentine ant) and S. mediterranea (flatworm). Pfam domain content was used as a means to evaluate the performance of these algorithms, under the assumption that a poorly annotated genome will be globally depleted for domains relative to well-annotated genomes. (A) The average Pfam domain contents for six well annotated eukaryotic reference proteomes: H. sapiens, M. musculus, D. melanogaster, C. elegans, A. thaliana, and S. cerevisiae. These data provide an upper bound for the expected domain content of a newly sequenced genome. The region of the pie chart outlined in red indicates the percentage of genes containing a Pfam domain; these are further subdivided by GO molecular function. (B) The Pfam domain content of SNAP produced ab initio predictions compared to MAKER2-SNAP gene annotations for the L. humile genome. (C) The Pfam domain content of SNAP ab initio gene predictions and MAKER2-SNAP annotations in the S. mediterranea genome.
Figure 2
Figure 2
Evaluating AED as a metric for annotation quality control. Annotation Edit Distance (AED) provides a measurement for how well an annotation agrees with overlapping aligned ESTs, mRNA-seq and protein homology data. AED values range from 0 and 1, with 0 denoting perfect agreement of the annotation to aligned evidence, and 1 denoting no evidence support for the annotation. We evaluated the use of AED as a quality control metric by comparing MAKER2 produced AED scores for release 30 (2003) of the M. musculus genome to the AEDs for release 37.1 (2007). These data show how AED can be used to quantify improvements to the annotations between each release. (A) The Pfam domain content of M. musculus release 30 for genes found in each quartile of the MAKER2 AED distribution. Note that genes with low AEDs are highly enriched for domains. (B) The fraction of M. musculus genes from release 30 maintained/removed from subsequent release 37.1 for each MAKER2 AED distribution quartile. These data show how AED mirrors the independent curation decisions made by the mouse research community between 2003 and 2007. (C) The cumulative AED distributions of M. musculus release 30 and 37.1 demonstrate how AED quantifies improvements made between releases. The subset of genes with NM prefixes assigned by RefSeq (which indicates the highest level of annotation quality) is plotted separately to show that these independently identified 'gold-standard' gene annotations tend to have lower AED values in comparison to the genome as a whole.
Figure 3
Figure 3
Re-annotation of a portion of the Maize genome using MAKER2. Annotation Edit Distance (AED) provides a measurement for how well an annotation agrees with its associated evidence (see text and Figure 1 for additional details). Shown are cumulative AED distributions for several Maize annotation datasets. Gold curve: AED distribution of high-quality 'gold standard' annotations in the benchmark region that are members of the J. Schnable and M. Freeling Classical Maize Genes List; These genes generally have the lowest AEDs. Red curve: all Maize gene models from the http://www.MaizeSequence.org 5a.59 Working Gene Set in the benchmark region; Blue curve: MAKER2's first pass, de novo annotations for the benchmark region; note that these genes generally have lower AEDs than the 5a.59 Working Gene Set (red curve). Purple curve: automatic MAKER2-based update/revision of the Maize 4a.53 Working Gene Set annotations. Note that the revised dataset now exceeds the quality of the 5a.59 Working Gene Set as judged by AED.
Figure 4
Figure 4
MAKER2 as a management tool for existing genome annotations. MAKER2 was used to add cross species homology evidence and AED values to six published ant species. These data show how MAKER2 can be used both to add new data to existing datasets and for downstream prioritization of genes in those datasets for further analysis and curation. (A) The Pfam domain content in each AED quartile. Genes receiving higher AED scores are less likely to contain a domain, thus prioritizing them as possible false positive gene predictions. (B) The percent of genes in each AED quartile having an orthologous protein in a related ant species with the average number of orthologs per gene (for the subset of orthologous genes) listed at the bottom. AED score is highly correlated with orthology. (C) The cumulative AED distribution for all six ant species. The spike of genes with AED score at or near 1 suggests potential false positive genes predictions rather than species-specific genes, as these annotations also generally lack EST support and Pfam domains; these gene models are first in MAKER2's list for manual review.
Figure 5
Figure 5
MAKER2 scales to even the largest genomes. MAKER2 was used to annotate a 10 megabase section of the C. elegans genome (NGASP dataset). The algorithm was parallelized using MPI on an increasing number of CPU cores. The results demonstrate how MAKER2 scales almost linearly with CPU number (with a slope of near 1). If we project our results forward to the entire C. elegans genome (~100 megabases), MAKER2 should take under 10 hours on 32 CPUs to complete; similarly, the human genome (~3 gigabases) would require fewer than 24 hours on 400 CPUs.

References

    1. Adams MD, Celniker SE, Holt RA, Evans CA, Gocayne JD, Amanatides PG, Scherer SE, Li PW, Hoskins RA, Galle RF. et al. The genome sequence of Drosophila melanogaster. Science. 2000;287(5461):2185–2195. doi: 10.1126/science.287.5461.2185. - DOI - PubMed
    1. The C. elegans Sequencing Consortium. Genome Sequence of the Nematode C. elegans: A Platform for Investigating Biology. Science. 1998;282(5396):2012–2018. - PubMed
    1. Waterston RH, Lindblad-Toh K, Birney E, Rogers J, Abril JF, Agarwal P, Agarwala R, Ainscough R, Alexandersson M, An P. et al. Initial sequencing and comparative analysis of the mouse genome. Nature. 2002;420(6915):520–562. doi: 10.1038/nature01262. - DOI - PubMed
    1. Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, Smith HO, Yandell M, Evans CA, Holt RA. et al. The Sequence of the Human Genome. Science. 2001;291(5507):1304–1351. doi: 10.1126/science.1058040. - DOI - PubMed
    1. Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W. et al. Initial sequencing and analysis of the human genome. Nature. 2001;409(6822):860–921. doi: 10.1038/35057062. - DOI - PubMed

Publication types

LinkOut - more resources