Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012 Jun 8:2012:bas026.
doi: 10.1093/database/bas026. Print 2012.

Improving links between literature and biological data with text mining: a case study with GEO, PDB and MEDLINE

Affiliations

Improving links between literature and biological data with text mining: a case study with GEO, PDB and MEDLINE

Aurélie Névéol et al. Database (Oxford). .

Abstract

High-throughput experiments and bioinformatics techniques are creating an exploding volume of data that are becoming overwhelming to keep track of for biologists and researchers who need to access, analyze and process existing data. Much of the available data are being deposited in specialized databases, such as the Gene Expression Omnibus (GEO) for microarrays or the Protein Data Bank (PDB) for protein structures and coordinates. Data sets are also being described by their authors in publications archived in literature databases such as MEDLINE and PubMed Central. Currently, the curation of links between biological databases and the literature mainly relies on manual labour, which makes it a time-consuming and daunting task. Herein, we analysed the current state of link curation between GEO, PDB and MEDLINE. We found that the link curation is heterogeneous depending on the sources and databases involved, and that overlap between sources is low, <50% for PDB and GEO. Furthermore, we showed that text-mining tools can automatically provide valuable evidence to help curators broaden the scope of articles and database entries that they review. As a result, we made recommendations to improve the coverage of curated links, as well as the consistency of information available from different databases while maintaining high-quality curation. Database URLs: http://www.ncbi.nlm.nih.gov/PubMed, http://www.ncbi.nlm.nih.gov/geo/, http://www.rcsb.org/pdb/

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Excerpt from a sample GEO SOFT file.
Figure 2.
Figure 2.
Excerpt from a sample MEDLINE citation.
Figure 3.
Figure 3.
Overlap between GEO, MEDLINE (SI) and the results of text-mining on PMC; evidence statements were extracted from full-text articles for three categories that were outside the consensus between link sources: (I) Articles curated in MEDLINE but not by GEO, (II) articles curated by GEO but not by MEDLINE and (III) articles curated neither by MEDLINE nor GEO, but identified as relevant for link curation for GEO by our automatic tool.
Figure 4.
Figure 4.
Overlap between PDB, MEDLINE (SI) and the results of text-mining on PMC; evidence statements were extracted from full-text articles for three categories that were outside the consensus between link sources: (I) Articles curated in MEDLINE but not by PDB, (II) articles curated by PDB but not by MEDLINE and (III) rticles curated neither by MEDLINE nor PDB, but identified as relevant for link curation for PDB by our automatic tool.

References

    1. Edgar R, Domrachev M, Lash AE. Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res. 2002;30:207–210. - PMC - PubMed
    1. Berman HM, Westbrook J, Feng Z, et al. The Protein Data Bank. Nucleic Acids Res. 2000;28:235–242. - PMC - PubMed
    1. Anonymous. 2008 Thou shalt share your data. Nat. Methods, 5, 209.
    1. Ochsner SA, Steffen DL, Stoeckert CJ, Jr, et al. Much room for improvement in deposition rates of expression microarray datasets. Nat. Methods. 2008;5:991. - PMC - PubMed
    1. Névéol A, Wilbur WJ, Lu Z. Extraction of data deposition statements from the literature: a method for automatically tracking research results. Bioinformatics. 2011;27:3306–3312. - PMC - PubMed

Publication types