Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Feb 15;14(2):e1005962.
doi: 10.1371/journal.pcbi.1005962. eCollection 2018 Feb.

A comprehensive and quantitative comparison of text-mining in 15 million full-text articles versus their corresponding abstracts

Affiliations

A comprehensive and quantitative comparison of text-mining in 15 million full-text articles versus their corresponding abstracts

David Westergaard et al. PLoS Comput Biol. .

Abstract

Across academia and industry, text mining has become a popular strategy for keeping up with the rapid growth of the scientific literature. Text mining of the scientific literature has mostly been carried out on collections of abstracts, due to their availability. Here we present an analysis of 15 million English scientific full-text articles published during the period 1823-2016. We describe the development in article length and publication sub-topics during these nearly 250 years. We showcase the potential of text mining by extracting published protein-protein, disease-gene, and protein subcellular associations using a named entity recognition system, and quantitatively report on their accuracy using gold standard benchmark data sets. We subsequently compare the findings to corresponding results obtained on 16.5 million abstracts included in MEDLINE and show that text mining of full-text articles consistently outperforms using abstracts only.

PubMed Disclaimer

Conflict of interest statement

SB and LJJ are on the scientific advisory board and have been among the founders of Intomics A/S with equity in the company.

Figures

Fig 1
Fig 1. Temporal corpus statistics derived from articles passing the pre-processing.
(a) Number of publications per year in the period 1823–2016. The full-text corpus encompasses both the PMC and TDM corpus. The growth in publications was found to fit an exponential model. (b) Temporal development in the distribution of six different topical categories in the period 1823–2016. Publications from health science journals made up nearly 75% of all publications until 1950, at which point it started to decrease rapidly. To date, it makes up approximately 25% of the publications in the full-text corpus. (c) Development in the number of pages per article in the period 1823–2016. The range of pages varies from 1–1,572 pages. Until year 1900 the number of one-page articles were increasing, at one point making up 75% of all articles. At the end of the 19th century, the number of one-page articles started to decrease, and by the start of the 21th century they made up less than 20%. Conversely, the number of articles with 11+ pages has been increasing, and by the start of the 21th century made up more than 20% of all articles.
Fig 2
Fig 2. Benchmarking the four different corpora.
In all cases the AUC is far greater than 0.5, indicating that the results obtained are better than random. The biggest gain in AUC is seen for disease-gene associations (a), followed by protein-compartment associations (c) and protein-protein associations (b).
Fig 3
Fig 3. Benchmarking the four different corpora at low false positive rates.
At a false positive rate of 10%, relevant to practical applications, the full-text corpus still outperforms the collection of MEDLINE abstracts for the extraction of (a) disease-gene associations. Conversely, the performance is the same for (b) protein-protein associations and (c) protein-compartment associations.

References

    1. Azevedo A. Integration of Data Mining in Business Intelligence Systems 1st Editio Azevedo A, Santos MF, editors. Integration of Data Mining in Business Intelligence Systems. IGI Publishing Hershey, PA, USA; 2014. 314 p.
    1. Krallinger M, Valencia A. Text-mining and information-retrieval services for molecular biology. Vol. 6, Genome biology. 20056(7):224 doi: 10.1186/gb-2005-6-7-224 - DOI - PMC - PubMed
    1. Fleuren WWM, Alkema W. Application of text mining in the biomedical domain. Vol. 74, Methods. 201574:97–106. doi: 10.1016/j.ymeth.2015.01.015 - DOI - PubMed
    1. Luo Y, Riedlinger G, Szolovits P. Text Mining in Cancer Gene and Pathway Prioritization. Vol. 13, Cancer Informatics. 201413(Suppl 1):69–79. doi: 10.4137/CIN.S13874 - DOI - PMC - PubMed
    1. Ananiadou S, Thompson P, Nawaz R, McNaught J, Kell DB. Event-based text mining for biology and functional genomics. Vol. 14, Briefings in functional genomics. 201514(3):213–30. doi: 10.1093/bfgp/elu015 - DOI - PMC - PubMed

Publication types

LinkOut - more resources