Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011 Dec 1;27(23):3306-12.
doi: 10.1093/bioinformatics/btr573. Epub 2011 Oct 13.

Extraction of data deposition statements from the literature: a method for automatically tracking research results

Affiliations

Extraction of data deposition statements from the literature: a method for automatically tracking research results

Aurélie Névéol et al. Bioinformatics. .

Abstract

Motivation: Research in the biomedical domain can have a major impact through open sharing of the data produced. For this reason, it is important to be able to identify instances of data production and deposition for potential re-use. Herein, we report on the automatic identification of data deposition statements in research articles.

Results: We apply machine learning algorithms to sentences extracted from full-text articles in PubMed Central in order to automatically determine whether a given article contains a data deposition statement, and retrieve the specific statements. With an Support Vector Machine classifier using conditional random field determined deposition features, articles containing deposition statements are correctly identified with 81% F-measure. An error analysis shows that almost half of the articles classified as containing a deposition statement by our method but not by the gold standard do indeed contain a deposition statement. In addition, our system was used to process articles in PubMed Central, predicting that a total of 52 932 articles report data deposition, many of which are not currently included in the Secondary Source Identifier [si] field for MEDLINE citations.

Availability: All annotated datasets described in this study are freely available from the NLM/NCBI website at http://www.ncbi.nlm.nih.gov/CBBresearch/Fellows/Neveol/DepositionDataSets.zip

Contact: aurelie.neveol@nih.gov; john.wilbur@nih.gov; zhiyong.lu@nih.gov

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Overview of annotated datasets used in this work.
Fig. 2.
Fig. 2.
Precision/recall curves for SVM and NB models built using all features.

References

    1. Anonymous Thou shalt share your data. Nat. Methods. 2008;5:209.
    1. Demner-Fushman D., et al. Automatically identifying health outcome information in MEDLINE records. J. Am. Med. Inform. Assoc. 2006;13:52–60. - PMC - PubMed
    1. Haeussler M., et al. Annotating genes and genomes with DNA sequences extracted from biomedical articles. Bioinformatics. 2011;27:980–986. - PMC - PubMed
    1. Kim J., et al. IS&T/SPIE's 22nd Annual Symposium on Electronic Imaging. San Jose, CA: 2010. Naïve Bayes and SVM classifiers for classifying databank accession number sentences from online biomedical articles. 7534:75340U-1-8.
    1. Kim S.N., et al. Automatic classification of sentences to support Evidence Based Medicine. BMC Bioinformatics. 2011;12(Suppl. 2):S5. - PMC - PubMed

Publication types