Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011 Sep 20:2011:bar034.
doi: 10.1093/database/bar034. Print 2011.

The curation paradigm and application tool used for manual curation of the scientific literature at the Comparative Toxicogenomics Database

Affiliations

The curation paradigm and application tool used for manual curation of the scientific literature at the Comparative Toxicogenomics Database

Allan Peter Davis et al. Database (Oxford). .

Erratum in

  • Database (Oxford). 2012;2012:bas012. Rosenstein, Michael C [added]

Abstract

The Comparative Toxicogenomics Database (CTD) is a public resource that promotes understanding about the effects of environmental chemicals on human health. CTD biocurators read the scientific literature and convert free-text information into a structured format using official nomenclature, integrating third party controlled vocabularies for chemicals, genes, diseases and organisms, and a novel controlled vocabulary for molecular interactions. Manual curation produces a robust, richly annotated dataset of highly accurate and detailed information. Currently, CTD describes over 349,000 molecular interactions between 6800 chemicals, 20,900 genes (for 330 organisms) and 4300 diseases that have been manually curated from over 25,400 peer-reviewed articles. This manually curated data are further integrated with other third party data (e.g. Gene Ontology, KEGG and Reactome annotations) to generate a wealth of toxicogenomic relationships. Here, we describe our approach to manual curation that uses a powerful and efficient paradigm involving mnemonic codes. This strategy allows biocurators to quickly capture detailed information from articles by generating simple statements using codes to represent the relationships between data types. The paradigm is versatile, expandable, and able to accommodate new data challenges that arise. We have incorporated this strategy into a web-based curation tool to further increase efficiency and productivity, implement quality control in real-time and accommodate biocurators working remotely. Database URL: http://ctd.mdibl.org.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
CTD data. Biocurators manually curate a triad of core interactions (solid lines) between chemicals (C), genes (G) and diseases (D) from the literature. These data are combined with external annotations from Gene Ontology (GO) and KEGG/Reactome pathways (P) via the shared use of NCBI Gene IDs. A unique feature of CTD is the inferred relationships generated by data integration: if a GO term is annotated to gene G, and independently gene G directly interacts with chemical C (via a curated interaction), then the GO term has an inferred relationship to chemical C (inferred via gene G). Data integration between these five nodes (C, G, D, GO and P) additionally yields novel, inferred relationships (dashed lines). In total, CTD becomes larger and more informative than the sum of its individual curated parts.
Figure 2.
Figure 2.
Anatomy of an interaction. Biocurators curate data in structured notation (top) by conjoining terms from multiple vocabularies (middle), including the chemical branch of MeSH, 4 chemical qualifiers, 4 action term degrees, 55 action terms, NCBI gene symbols and 16 gene qualifiers. Multiplexing these short lists allow exponential combinations. Here, the biocurator additionally chose bisphenol A for C1 and ESR1 for G1 to complete the interaction. The notation is translated and displayed as a sentence on public CTD (yellow box).
Figure 3.
Figure 3.
CTD curation codes. (A) Biocurators use controlled vocabularies and mnemonic codes to construct interactions describing the molecular interaction (increased secretion) between the chemical lipopolysaccharides (C1) and the protein product of the tumor necrosis factor gene (G1/p). (B) The interaction can be expanded using brackets and the reaction code (rxn) to indicate how another chemical inhibits the first interaction. (C) Disease curation captures the relationship between chemicals/genes and a disease. Every interaction is directly associated to a PMID and includes the species in which the interaction was studied. The interactions are translated into sentences (yellow boxes) for users to interpret more easily.
Figure 4.
Figure 4.
Curation tool overview. (1) Biocurators submit a PMID to create a ‘PubMed Curation Activity’ page. (2) This page has a hyperlink to the PubMed abstract, which the biocurators use for curation. (3) Based upon the abstract, biocurators can then enter new interactions, edit pre-existing interactions, or clone interactions (to modify any data field to generate a new interaction without having to re-enter all the fields each time). On the ‘Interaction Entry Page’ biocurators construct the interaction using structured notation and mnemonic codes and fill in the necessary data fields. Additional internal data not yet currently displayed on the public website can also be selected, including: in vivo versus in vitro methods, full-text versus abstract curation (to help with subsequent text-mining evaluations), if the curation was derived from a high-throughput assay, any type of gene accession ID and curator notes (for any other helpful comment about the curation). (4) When available, the email address of the corresponding author is stored. (5) Additional features allow the biocurator to upload data en masse from an Excel spreadsheet or generate a report of their previously submitted work.
Figure 5.
Figure 5.
Detailed view of ‘Interaction Entry Page’. After a biocurator composes a new interaction and tabs out of the cell, the curation tool automatically pops up the required data fields (here, C2, C1, G1 and G2) to correctly complete the interaction. Since ‘Taxon’ is a requirement of all interactions, it is always displayed in the curation tool window, and biocurators can either use a pick-list to select the most commonly entered species or directly type in any species.
Figure 6.
Figure 6.
Color-coded QC. (A) If an invalid curation code (here, ‘sce’) is entered in the interaction field (Ixn), the tool automatically alerts the biocurator by coloring the window red (‘STOP’) and producing an error report at the bottom of the page (red circle). The interaction cannot be saved until the biocurator fixes the error. Notice that the terms for C1, G1 and Taxon are correctly entered and the fields remain green. (B) Terms entered by a biocurator for chemicals (C1), genes (G1, G2), diseases (data not shown) and Taxon are automatically compared against CTD’s controlled vocabularies and are color-coded according to their correspondence. Here, the C1 term Arsenic is an acceptable official term and is highlighted in green. The G1 term TIKI, however, does not match any official gene symbol or synonym in CTD, so the curation tool alerts the biocurator in red. The G2 term BOB does not match any official gene symbol in CTD, but is a synonym for more than one gene; since the tool cannot deduce which was the intended official symbol, the term is flagged as purple for the biocurator to resolve. In Taxon, however, the biocurator originally entered ‘Dog’ and the curation tool was able to resolve it as a synonym to just one official term; the tool automatically replaces ‘Dog’ with that term (Canis lupus familiaris) but still cautions the biocurator to double-check the automatic selection made by the curation tool.
Figure 7.
Figure 7.
CTD/PostgreSQL logical database architecture. CTD is logically comprised of three major databases: Curation Database (yellow), 3rd Party Database (green) and Public Web Application (PWA) Database (blue). Biocurators, via the web, submit manually curated interactions and information that end up in the Curation Database. The 3rd Party Database contains data extracted from external sources (e.g., NCBI, GO, MeSH, OMIM, etc.). The PWA Database is loaded on a monthly basis and represents an integration of the Curation Database and the 3rd Party Database and is designed as a high-speed reporting database with selective denormalizations and data rollups. The PWA Database also contains novel, associative data (e.g. calculations for inference scores, enrichment scores, and Jaccard indexing, etc.). Users access CTD via the PWA.

References

    1. Davis AP, King BL, Mockus S, et al. The Comparative Toxicogenomics Database: update 2011. Nucleic Acids Res. 2011;39:D1067–D1072. - PMC - PubMed
    1. Gohlke JM, Thomas R, Zhang Y, et al. Genetic and environmental pathways to complex diseases. BMC Syst Biol. 2009;3:46. - PMC - PubMed
    1. Davis AP, Murphy CG, Saraceni-Richards CA, et al. Comparative Toxicogenomics Database: a knowledgebase and discovery tool for chemical-gene-disease networks. Nucleic Acids Res. 2009;37:D786–D792. - PMC - PubMed
    1. Davis AP, Murphy CG, Rosenstein MC, et al. The Comparative Toxicogenomics Database facilitates identification and understanding of chemical-gene-disease associations: arsenic as a case study. BMC Med Genomics. 2008;1:48. - PMC - PubMed
    1. Mattingly CJ, Rosenstein MC, Davis AP, et al. The Comparative Toxicogenomics Database: a cross-species resource for building chemical-gene interaction networks. Toxicol Sci. 2006;92:587–595. - PMC - PubMed

Publication types

Substances