Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011 Oct 15;5(1):168-93.
doi: 10.4056/sigs.2084864. Epub 2011 Oct 1.

Solving the Problem: Genome Annotation Standards before the Data Deluge

Solving the Problem: Genome Annotation Standards before the Data Deluge

William Klimke et al. Stand Genomic Sci. .

Abstract

The promise of genome sequencing was that the vast undiscovered country would be mapped out by comparison of the multitude of sequences available and would aid researchers in deciphering the role of each gene in every organism. Researchers recognize that there is a need for high quality data. However, different annotation procedures, numerous databases, and a diminishing percentage of experimentally determined gene functions have resulted in a spectrum of annotation quality. NCBI in collaboration with sequencing centers, archival databases, and researchers, has developed the first international annotation standards, a fundamental step in ensuring that high quality complete prokaryotic genomes are available as gold standard references. Highlights include the development of annotation assessment tools, community acceptance of protein naming standards, comparison of annotation resources to provide consistent annotation, and improved tracking of the evidence used to generate a particular annotation. The development of a set of minimal standards, including the requirement for annotated complete prokaryotic genomes to contain a full set of ribosomal RNAs, transfer RNAs, and proteins encoding core conserved functions, is an historic milestone. The use of these standards in existing genomes and future submissions will increase the quality of databases, enabling researchers to make accurate biological discoveries.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Selected comparisons of genome measures. Principal component analysis showed expected relationships among the different measures (data not shown). Selected examples are plotted as double y-axis scatterplots. Legends indicate first or second y-axis for blue dots or red crosses, respectively. Linear regression analysis of each y-axes variable independently with respect to the x-axis variable was done and the trend line is drawn on each plot color-coded with respect to each measure. R2 and p-values are shown for each measure. A-B. Numbers of annotated proteins and RNAs with respect to genome size from INSDC and RefSeq annotation sets for complete prokaryotic genomes. Feature counts were obtained from the Complete Microbial Genomes Annotation Report (Aug 10, 2010) and proteins and RNAs from INSDC and RefSeq are plotted with respect to genome length. The count of proteins follows a linear increase with respect to increasing genome size (blue trend line) while the RNA count, which includes all transfer, ribosomal, and non-coding RNAs, shows less of an increase with respect to genome size. Some genomes have extensively annotated RNA features, whereas others do not. A. All INSDC genomes (total of 1218 as of Aug 10, 2010). Those records that have below minimal standards for essential RNAs are encircled (red ellipse). B. RefSeq genomes (total of 1148 genomes as of Aug 10, 2010). Note, not all INSDC genomes are copied into RefSeq records. For the cases where INSDC records were missing essential RNAs, if there was a RefSeq version, the essential RNAs have been added or properly labeled. In all cases where the full set of essential RNAs could not be annotated it appeared that the missing RNA(s) were either non-functional or completely missing from the genome sequence (Table 3; data not shown). C. Protein lengths with respect to coding density for INSDC annotations. As coding density increases (more proteins per Kbp) the average protein length decreases (blue trend line) and the ratio of short proteins increases (red trend line). D. Hypothetical proteins and start codon ratios versus coding density. The ratio of proteins named 'hypothetical' increases slightly as the coding density increases whereas the standard start codon ratio decreases. Genomes where 'hypothetical protein' ratio is 1 or near 1 (large blue ellipse - every protein is annotated as 'hypothetical protein' in the genome) falls below the minimal annotation standards. For these particular cases, if a RefSeq version of the annotation existed, the functional assignment of a number of proteins was improved via curated clusters in the NCBI ProtClustDB (data not shown).
Figure 2
Figure 2
Heatmap of selected annotation report measures for gammaproteobacteria. A set of measures were chosen corresponding to those used in principal component analysis (data not shown) but restricted to INSDC genomes from gammaproteobacteria. A two-dimensional clustering of the selected and scaled data (subtracted column means, division by standard deviation) demonstrates similar clusters that were obtained in the PCA analysis (data not shown). For Figure 2, no clustering was done and the input genomes are arranged alphabetically by organism name and shaded to indicate different genera. A color-key and histogram at bottom right indicate the relative intensities of the annotation measures (the histogram applies to all measures, color intensities apply to each cell). Genomes described in the text are in bold.

References

    1. Bork P, Ouzounis C, Sander C, Scharf M, Schneider R, Sonnhammer E. Comprehensive sequence analysis of the 182 predicted open reading frames of yeast chromosome III. Protein Sci 1992; 1:1677-1690 10.1002/pro.5560011216 - DOI - PMC - PubMed
    1. Bork P, Ouzounis C, Sander C, Scharf M, Schneider R, Sonnhammer E. What's in a genome? Nature 1992; 358:287 10.1038/358287a0 - DOI - PubMed
    1. Fleischmann RD, Adams MD, White O, Clayton RA, Kirkness EF, Kerlavage AR, Bult CJ, Tomb JF, Dougherty BA, Merrick JM, et al. Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science 1995; 269:496-512 10.1126/science.7542800 - DOI - PubMed
    1. Madupu R, Brinkac LM, Harrow J, Wilming LG, Bohme U, Lamesch P, Hannick LI. Meeting report: a workshop on Best Practices in Genome Annotation. Database (Oxford) 2010;2010:baq001. - PMC - PubMed
    1. White O, Kyrpides N. Meeting Report: Towards a Critical Assessment of Functional Annotation Experiment (CAFAE) for bacterial genome annotation. Stand Genomic Sci 2010; 3:240-242 10.4056/sigs.1323436 - DOI - PMC - PubMed

LinkOut - more resources