. 2018 Feb 15;14(2):e1005962.

doi: 10.1371/journal.pcbi.1005962. eCollection 2018 Feb.

A comprehensive and quantitative comparison of text-mining in 15 million full-text articles versus their corresponding abstracts

David Westergaard^{1

2}, Hans-Henrik Stærfeldt¹, Christian Tønsberg³, Lars Juhl Jensen², Søren Brunak¹

Affiliations

¹ Center for Biological Sequence Analysis, Department of Bio and Health Informatics, Technical University of Denmark, Lyngby, Denmark.
² Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark.
³ Office for Innovation and Sector Services, Technical Information Center of Denmark, Technical University of Denmark, Lyngby, Denmark.

PMID: 29447159
PMCID: PMC5831415
DOI: 10.1371/journal.pcbi.1005962

A comprehensive and quantitative comparison of text-mining in 15 million full-text articles versus their corresponding abstracts

David Westergaard et al. PLoS Comput Biol. 2018.

. 2018 Feb 15;14(2):e1005962.

doi: 10.1371/journal.pcbi.1005962. eCollection 2018 Feb.

Authors

David Westergaard^{1

2}, Hans-Henrik Stærfeldt¹, Christian Tønsberg³, Lars Juhl Jensen², Søren Brunak¹

Affiliations

¹ Center for Biological Sequence Analysis, Department of Bio and Health Informatics, Technical University of Denmark, Lyngby, Denmark.
² Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark.
³ Office for Innovation and Sector Services, Technical Information Center of Denmark, Technical University of Denmark, Lyngby, Denmark.

PMID: 29447159
PMCID: PMC5831415
DOI: 10.1371/journal.pcbi.1005962

Abstract

Across academia and industry, text mining has become a popular strategy for keeping up with the rapid growth of the scientific literature. Text mining of the scientific literature has mostly been carried out on collections of abstracts, due to their availability. Here we present an analysis of 15 million English scientific full-text articles published during the period 1823-2016. We describe the development in article length and publication sub-topics during these nearly 250 years. We showcase the potential of text mining by extracting published protein-protein, disease-gene, and protein subcellular associations using a named entity recognition system, and quantitatively report on their accuracy using gold standard benchmark data sets. We subsequently compare the findings to corresponding results obtained on 16.5 million abstracts included in MEDLINE and show that text mining of full-text articles consistently outperforms using abstracts only.

PubMed Disclaimer

Conflict of interest statement

SB and LJJ are on the scientific advisory board and have been among the founders of Intomics A/S with equity in the company.

Figures

**Fig 1. Temporal corpus statistics derived from articles passing the pre-processing.**
**(a)** Number of publications per year in the period 1823–2016. The full-text corpus encompasses both the PMC and TDM corpus. The growth in publications was found to fit an exponential model. **(b)** Temporal development in the distribution of six different topical categories in the period 1823–2016. Publications from health science journals made up nearly 75% of all publications until 1950, at which point it started to decrease rapidly. To date, it makes up approximately 25% of the publications in the full-text corpus. **(c)** Development in the number of pages per article in the period 1823–2016. The range of pages varies from 1–1,572 pages. Until year 1900 the number of one-page articles were increasing, at one point making up 75% of all articles. At the end of the 19th century, the number of one-page articles started to decrease, and by the start of the 21th century they made up less than 20%. Conversely, the number of articles with 11+ pages has been increasing, and by the start of the 21th century made up more than 20% of all articles.

**Fig 2. Benchmarking the four different corpora.**
In all cases the AUC is far greater than 0.5, indicating that the results obtained are better than random. The biggest gain in AUC is seen for disease-gene associations **(a)**, followed by protein-compartment associations **(c)** and protein-protein associations **(b)**.

**Fig 3. Benchmarking the four different corpora at low false positive rates.**
At a false positive rate of 10%, relevant to practical applications, the full-text corpus still outperforms the collection of MEDLINE abstracts for the extraction of **(a)** disease-gene associations. Conversely, the performance is the same for **(b)** protein-protein associations and **(c)** protein-compartment associations.

See this image and copyright information in PMC

References

1. Azevedo A. Integration of Data Mining in Business Intelligence Systems 1st Editio Azevedo A, Santos MF, editors. Integration of Data Mining in Business Intelligence Systems. IGI Publishing Hershey, PA, USA; 2014. 314 p.
1. Krallinger M, Valencia A. Text-mining and information-retrieval services for molecular biology. Vol. 6, Genome biology. 20056(7):224 doi: 10.1186/gb-2005-6-7-224 - DOI - PMC - PubMed
1. Fleuren WWM, Alkema W. Application of text mining in the biomedical domain. Vol. 74, Methods. 201574:97–106. doi: 10.1016/j.ymeth.2015.01.015 - DOI - PubMed
1. Luo Y, Riedlinger G, Szolovits P. Text Mining in Cancer Gene and Pathway Prioritization. Vol. 13, Cancer Informatics. 201413(Suppl 1):69–79. doi: 10.4137/CIN.S13874 - DOI - PMC - PubMed
1. Ananiadou S, Thompson P, Nawaz R, McNaught J, Kell DB. Event-based text mining for biology and functional genomics. Vol. 14, Briefings in functional genomics. 201514(3):213–30. doi: 10.1093/bfgp/elu015 - DOI - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A comprehensive and quantitative comparison of text-mining in 15 million full-text articles versus their corresponding abstracts

Affiliations

A comprehensive and quantitative comparison of text-mining in 15 million full-text articles versus their corresponding abstracts

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources