RefSeq: an update on prokaryotic genome annotation and curation

Daniel H Haft¹, Michael DiCuccio¹, Azat Badretdin¹, Vyacheslav Brover¹, Vyacheslav Chetvernin¹, Kathleen O'Neill¹, Wenjun Li¹, Farideh Chitsaz¹, Myra K Derbyshire¹, Noreen R Gonzales¹, Marc Gwadz¹, Fu Lu¹, Gabriele H Marchler¹, James S Song¹, Narmada Thanki¹, Roxanne A Yamashita¹, Chanjuan Zheng¹, Françoise Thibaud-Nissen¹, Lewis Y Geer¹, Aron Marchler-Bauer¹, Kim D Pruitt¹

Affiliations

PMID: 29112715
PMCID: PMC5753331
DOI: 10.1093/nar/gkx1068

RefSeq: an update on prokaryotic genome annotation and curation

Daniel H Haft et al. Nucleic Acids Res. 2018.

. 2018 Jan 4;46(D1):D851-D860.

doi: 10.1093/nar/gkx1068.

Authors

Affiliation

¹ National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 45 Center Drive, Bethesda, MD 20892-6511, USA.

PMID: 29112715
PMCID: PMC5753331
DOI: 10.1093/nar/gkx1068

Abstract

The Reference Sequence (RefSeq) project at the National Center for Biotechnology Information (NCBI) provides annotation for over 95 000 prokaryotic genomes that meet standards for sequence quality, completeness, and freedom from contamination. Genomes are annotated by a single Prokaryotic Genome Annotation Pipeline (PGAP) to provide users with a resource that is as consistent and accurate as possible. Notable recent changes include the development of a hierarchical evidence scheme, a new focus on curating annotation evidence sources, the addition and curation of protein profile hidden Markov models (HMMs), release of an updated pipeline (PGAP-4), and comprehensive re-annotation of RefSeq prokaryotic genomes. Antimicrobial resistance proteins have been reannotated comprehensively, improved structural annotation of insertion sequence transposases and selenoproteins is provided, curated complex domain architectures have given upgraded names to millions of multidomain proteins, and we introduce a new kind of annotation rule-BlastRules. Continual curation of supporting evidence, and propagation of improved names onto RefSeq proteins ensures that the functional annotation of genomes is kept current. An increasing share of our annotation now derives from HMMs and other sets of annotation rules that are portable by nature, and available for download and for reuse by other investigators. RefSeq is found at https://www.ncbi.nlm.nih.gov/refseq/.

Published by Oxford University Press on behalf of Nucleic Acids Research 2017.

PubMed Disclaimer

Figures

**Figure 1.**
New workflow for structural annotation by the PGAP 4.x series pipeline. Computational processes are shown in blue, data in white or gray. GeneMarkS+ provides *ab initio* prediction of protein-coding genes, but in the context of hints from homology-based evidence, including HMM evidence for the first time. The use of ORFfinder to produce every stop-to-stop translations, and HMM searching to find every translation with an HMM hit, are steps first introduced in the PGAP-4.1 release. The pipeline detects both disrupted genes (e.g. pseudogenes) and exceptional reading frames (e.g. selenoproteins).

**Figure 2.**
A partially expanded view of the homology evidence and protein naming hierarchy used in RefSeq and PGAP annotation. Four families of beta-lactamases are shown (A, metallo, C, and D), each of which is more similar to various hydrolases of other substrates, such as RNA, than to any members of the other beta-lactamase classes. For each class, a protein profile HMM identifies members and suggests a protein product name, but further expansion of the hierarchy can reveal multiple child families, each identified by a more specific HMM that receives a higher precedence during annotation. The hierarchy of evidence largely follows an implicit hierarchy of protein names, with exceptions necessary occasionally, as when unrelated proteins perform closely related functions.

See this image and copyright information in PMC

References

1. Cochrane G., Karsch-Mizrachi I., Takagi T. International Nucleotide Sequence Database, C. . The international nucleotide sequence database collaboration. Nucleic Acids Res. 2016; 44:D48–D50. - PMC - PubMed
1. Maglott D.R., Katz K.S., Sicotte H., Pruitt K.D.. NCBI’s LocusLink and RefSeq. Nucleic Acids Res. 2000; 28:126–128. - PMC - PubMed
1. Tatusova T., DiCuccio M., Badretdin A., Chetvernin V., Nawrocki E.P., Zaslavsky L., Lomsadze A., Pruitt K.D., Borodovsky M., Ostell J.. NCBI prokaryotic genome annotation pipeline. Nucleic Acids Res. 2016; 44:6614–6624. - PMC - PubMed
1. Tatusova T., Ciufo S., Fedorov B., O’Neill K., Tolstoy I.. RefSeq microbial genomes database: new representation and annotation strategy. Nucleic Acids Res. 2014; 42:D553–D559. - PMC - PubMed
1. Cole S.T., Brosch R., Parkhill J., Garnier T., Churcher C., Harris D., Gordon S.V., Eiglmeier K., Gas S., Barry C.E. 3rd et al. Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence. Nature. 1998; 393:537–544. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

RefSeq: an update on prokaryotic genome annotation and curation

Affiliation

RefSeq: an update on prokaryotic genome annotation and curation

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources