Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012 Dec 6:2012:bas051.
doi: 10.1093/database/bas051. Print 2012.

Targeted journal curation as a method to improve data currency at the Comparative Toxicogenomics Database

Affiliations

Targeted journal curation as a method to improve data currency at the Comparative Toxicogenomics Database

Allan Peter Davis et al. Database (Oxford). .

Abstract

The Comparative Toxicogenomics Database (CTD) is a public resource that promotes understanding about the effects of environmental chemicals on human health. CTD biocurators read the scientific literature and manually curate a triad of chemical-gene, chemical-disease and gene-disease interactions. Typically, articles for CTD are selected using a chemical-centric approach by querying PubMed to retrieve a corpus containing the chemical of interest. Although this technique ensures adequate coverage of knowledge about the chemical (i.e. data completeness), it does not necessarily reflect the most current state of all toxicological research in the community at large (i.e. data currency). Keeping databases current with the most recent scientific results, as well as providing a rich historical background from legacy articles, is a challenging process. To address this issue of data currency, CTD designed and tested a journal-centric approach of curation to complement our chemical-centric method. We first identified priority journals based on defined criteria. Next, over 7 weeks, three biocurators reviewed 2425 articles from three consecutive years (2009-2011) of three targeted journals. From this corpus, 1252 articles contained relevant data for CTD and 52 752 interactions were manually curated. Here, we describe our journal selection process, two methods of document delivery for the biocurators and the analysis of the resulting curation metrics, including data currency, and both intra-journal and inter-journal comparisons of research topics. Based on our results, we expect that curation by select journals can (i) be easily incorporated into the curation pipeline to complement our chemical-centric approach; (ii) build content more evenly for chemicals, genes and diseases in CTD (rather than biasing data by chemicals-of-interest); (iii) reflect developing areas in environmental health and (iv) improve overall data currency for chemicals, genes and diseases. Database URL: http://ctdbase.org/

PubMed Disclaimer

Figures

Figure 1
Figure 1
Data currency at CTD. In March 2012, CTD contained 88 035 articles published between 1946 and 2012, including 46 113 (52%) legacy articles (grey), 36 900 (42%) contemporary articles (blue) and 5022 (6%) current articles (red); for simplicity, the number of articles for publication years 1946–66 were condensed into a single bar. When the number of curated articles in CTD is compared against an approximate number of available toxicogenomic articles from PubMed (solid black line), a noticeable hypothetical minimum gap in data currency is seen, especially for years 2010–11 (dashed lines). To approximate the number of hypothetical toxicogenomic articles for each year, PubMed was queried with the generic string: (toxicology OR toxicogenomics) OR [chemical AND (gene OR mRNA OR transcript)] NOT review[pt] AND ("YYYY/01/01"[PPDAT]:"YYYY/12/31"[PPDAT]), where YYYY = year of interest. The retrieved background is clearly an underrepresentation of the possible available literature (perhaps by as much as 2-fold; (1)); thus, the gap in data currency is a described as a ‘minimum gap’.
Figure 2
Figure 2
Intra-journal data comparison for 2009–11. Nine Venn diagrams depict the overlapping datasets for the number of chemicals, genes and diseases for each journal for publication years 2009 (blue circles), 2010 (green circles) and 2011 (red circles). Yellow boxes provide examples of shared elements for all 3 years in the centre intersection of each Venn diagram and are described in the main text. TS = Toxicological Sciences, CBI = Chemico-Biological Interactions and EHP = Environmental Health Perspectives. All data are provided in the Supplementary Data, and readers can use CTD’s ‘MyVenn’ tool (http://ctdbase.org/tools/myVenn.go) to re-draw the Venn diagrams to explore all the sets.
Figure 3
Figure 3
Prominent environmental chemicals from inter-journal comparison. Three Venn diagrams depict the overlapping datasets for curated chemicals shared by journals TS (purple circles), CBI (black circles) and EHP (orange circles) for years 2009–11. The first three chemicals in each list (blue) are shared by all three journals for all 3 years, and nine chemicals (green) are shared in 2 of the 3 years. The other listed chemicals (black) are shared by the three journals for that unique year. Seven chemicals (red checks) are known to modulate sex hormone receptor signalling pathways. All data are provided in the Supplementary Data, and readers can use CTD’s ‘MyVenn’ tool (http://ctdbase.org/tools/myVenn.go) to re-draw the Venn diagrams to explore all the sets.
Figure 4
Figure 4
Trending toxicology gene sets from inter-journal comparison. Three Venn diagrams depict the overlapping datasets for curated genes shared by journals TS (purple circles), CBI (black circles) and EHP (orange circles) for years 2009–11. Fifteen genes (blue) are shared by all three journals for all 3 years, and 30 other genes (green) are shared in 2 of the 3 years. The additional genes specific for each individual year are not shown but listed as 59 (for 2009), 51 (for 2010) and 41 (for 2011). All data are provided in the Supplementary Data, and readers can use CTD’s ‘MyVenn’ tool (http://ctdbase.org/tools/myVenn.go) to re-draw the Venn diagrams to explore all the sets.
Figure 5
Figure 5
Environmental diseases from inter-journal comparison. Three Venn diagrams depict the overlapping datasets for curated diseases shared by journals TS (purple circles), CBI (black circles) and EHP (orange circles) for years 2009–11. Inflammation (blue) is shared by all three journals for all 3 years, and experimental neoplasms and seizures (green) are shared in 2 of the 3 years. The other listed diseases (black) are shared by the three journals for that unique year. In 2011, two pre-diabetes markers (red checks) are shared among all three journals. All data are provided in the Supplementary Data, and readers can use CTD’s ‘MyVenn’ tool (http://ctdbase.org/tools/myVenn.go) to re-draw the Venn diagrams to explore all the sets.
Figure 6
Figure 6
CTD’s two complementary processes for literature selection and curation. In the chemical-centric approach, each month we select several chemicals-of-interest from our Chemical Priority Matrix to query PubMed for all the literature (both current and legacy) for each chemical. Depending upon the size of the corpus, either all the abstracts are sent to the biocurator, or they are first processed through CTD’s text-mining algorithm to rank and prioritize the papers based upon data content. This approach results in data completeness for the chemical. In the journal-centric approach, we could retrieve the complete set of articles for selected targeted journals on a regular basis (perhaps semi-annually), providing a corpus of research papers that more accurately reflects the current state of toxicogenomics, regardless of any chemical bias. This method results in improved overall data currency at CTD.
Figure 7
Figure 7
Expanding targeted journal curation at CTD. From March to August 2012, 9631 new articles were added to CTD. Of these, 4254 are from targeted journal curation, including 1252 from the three journals (TS, CBI and EHP) reported here (yellow bars) plus 3002 articles from nine additional journals (green bars) for publication years from 2009 to the first half of 2012. The remaining 5377 articles (black bars) are from other CTD projects and span publication years 1962–2012.

References

    1. Davis AP, King BL, Mockus S, et al. The Comparative Toxicogenomics Database: update 2011. Nucleic Acids Res. 2011;39:D1067–D1072. - PMC - PubMed
    1. Gohlke JM, Thomas R, Zhang Y, et al. Genetic and environmental pathways to complex diseases. BMC Syst. Biol. 2009;3:46. - PMC - PubMed
    1. Davis AP, Murphy CG, Saraceni-Richards CA, et al. Comparative Toxicogenomics Database: a knowledgebase and discovery tool for chemical–gene–disease networks. Nucleic Acids Res. 2009;37:D786–D792. - PMC - PubMed
    1. Davis AP, Murphy CG, Rosenstein MC, et al. The Comparative Toxicogenomics Database facilitates identification and understanding of chemical–gene–disease associations: arsenic as a case study. BMC Med. Genomics. 2008;1:48. - PMC - PubMed
    1. Davis AP, Wiegers TC, Murphy CG, et al. The curation paradigm and application tool used for manual curation of the scientific literature at the Comparative Toxicogenomics Database. Database. 2011;Vol. 2011 doi:10.1093/database/bar034. - PMC - PubMed

Publication types