Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Jan 4;46(D1):D851-D860.
doi: 10.1093/nar/gkx1068.

RefSeq: an update on prokaryotic genome annotation and curation

Affiliations

RefSeq: an update on prokaryotic genome annotation and curation

Daniel H Haft et al. Nucleic Acids Res. .

Abstract

The Reference Sequence (RefSeq) project at the National Center for Biotechnology Information (NCBI) provides annotation for over 95 000 prokaryotic genomes that meet standards for sequence quality, completeness, and freedom from contamination. Genomes are annotated by a single Prokaryotic Genome Annotation Pipeline (PGAP) to provide users with a resource that is as consistent and accurate as possible. Notable recent changes include the development of a hierarchical evidence scheme, a new focus on curating annotation evidence sources, the addition and curation of protein profile hidden Markov models (HMMs), release of an updated pipeline (PGAP-4), and comprehensive re-annotation of RefSeq prokaryotic genomes. Antimicrobial resistance proteins have been reannotated comprehensively, improved structural annotation of insertion sequence transposases and selenoproteins is provided, curated complex domain architectures have given upgraded names to millions of multidomain proteins, and we introduce a new kind of annotation rule-BlastRules. Continual curation of supporting evidence, and propagation of improved names onto RefSeq proteins ensures that the functional annotation of genomes is kept current. An increasing share of our annotation now derives from HMMs and other sets of annotation rules that are portable by nature, and available for download and for reuse by other investigators. RefSeq is found at https://www.ncbi.nlm.nih.gov/refseq/.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
New workflow for structural annotation by the PGAP 4.x series pipeline. Computational processes are shown in blue, data in white or gray. GeneMarkS+ provides ab initio prediction of protein-coding genes, but in the context of hints from homology-based evidence, including HMM evidence for the first time. The use of ORFfinder to produce every stop-to-stop translations, and HMM searching to find every translation with an HMM hit, are steps first introduced in the PGAP-4.1 release. The pipeline detects both disrupted genes (e.g. pseudogenes) and exceptional reading frames (e.g. selenoproteins).
Figure 2.
Figure 2.
A partially expanded view of the homology evidence and protein naming hierarchy used in RefSeq and PGAP annotation. Four families of beta-lactamases are shown (A, metallo, C, and D), each of which is more similar to various hydrolases of other substrates, such as RNA, than to any members of the other beta-lactamase classes. For each class, a protein profile HMM identifies members and suggests a protein product name, but further expansion of the hierarchy can reveal multiple child families, each identified by a more specific HMM that receives a higher precedence during annotation. The hierarchy of evidence largely follows an implicit hierarchy of protein names, with exceptions necessary occasionally, as when unrelated proteins perform closely related functions.

References

    1. Cochrane G., Karsch-Mizrachi I., Takagi T. International Nucleotide Sequence Database, C. . The international nucleotide sequence database collaboration. Nucleic Acids Res. 2016; 44:D48–D50. - PMC - PubMed
    1. Maglott D.R., Katz K.S., Sicotte H., Pruitt K.D.. NCBI’s LocusLink and RefSeq. Nucleic Acids Res. 2000; 28:126–128. - PMC - PubMed
    1. Tatusova T., DiCuccio M., Badretdin A., Chetvernin V., Nawrocki E.P., Zaslavsky L., Lomsadze A., Pruitt K.D., Borodovsky M., Ostell J.. NCBI prokaryotic genome annotation pipeline. Nucleic Acids Res. 2016; 44:6614–6624. - PMC - PubMed
    1. Tatusova T., Ciufo S., Fedorov B., O’Neill K., Tolstoy I.. RefSeq microbial genomes database: new representation and annotation strategy. Nucleic Acids Res. 2014; 42:D553–D559. - PMC - PubMed
    1. Cole S.T., Brosch R., Parkhill J., Garnier T., Churcher C., Harris D., Gordon S.V., Eiglmeier K., Gas S., Barry C.E. 3rd et al. . Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence. Nature. 1998; 393:537–544. - PubMed

Publication types