Auto-CORPus: A Natural Language Processing Tool for Standardizing and Reusing Biomedical Literature

Tim Beck^{1

2}, Tom Shorter¹, Yan Hu^{3

4}, Zhuoyu Li³, Shujian Sun³, Casiana M Popovici^{3

4}, Nicholas A R McQuibban^{3

5}, Filip Makraduli³, Cheng S Yeung³, Thomas Rowlands¹, Joram M Posma^{2

3}

Affiliations

¹ Department of Genetics and Genome Biology, University of Leicester, Leicester, United Kingdom.
² Health Data Research UK (HDR UK), London, United Kingdom.
³ Section of Bioinformatics, Division of Systems Medicine, Department of Metabolism, Digestion and Reproduction, Imperial College London, London, United Kingdom.
⁴ Department of Surgery and Cancer, Imperial College London, London, United Kingdom.
⁵ Centre for Integrative Systems Biology and Bioinformatics (CISBIO), Department of Life Sciences, Imperial College London, London, United Kingdom.

PMID: 35243479
PMCID: PMC8885717
DOI: 10.3389/fdgth.2022.788124

Auto-CORPus: A Natural Language Processing Tool for Standardizing and Reusing Biomedical Literature

Tim Beck et al. Front Digit Health. 2022.

. 2022 Feb 15:4:788124.

doi: 10.3389/fdgth.2022.788124. eCollection 2022.

Authors

Affiliations

¹ Department of Genetics and Genome Biology, University of Leicester, Leicester, United Kingdom.
² Health Data Research UK (HDR UK), London, United Kingdom.
³ Section of Bioinformatics, Division of Systems Medicine, Department of Metabolism, Digestion and Reproduction, Imperial College London, London, United Kingdom.
⁴ Department of Surgery and Cancer, Imperial College London, London, United Kingdom.
⁵ Centre for Integrative Systems Biology and Bioinformatics (CISBIO), Department of Life Sciences, Imperial College London, London, United Kingdom.

PMID: 35243479
PMCID: PMC8885717
DOI: 10.3389/fdgth.2022.788124

Abstract

To analyse large corpora using machine learning and other Natural Language Processing (NLP) algorithms, the corpora need to be standardized. The BioC format is a community-driven simple data structure for sharing text and annotations, however there is limited access to biomedical literature in BioC format and a lack of bioinformatics tools to convert online publication HTML formats to BioC. We present Auto-CORPus (Automated pipeline for Consistent Outputs from Research Publications), a novel NLP tool for the standardization and conversion of publication HTML and table image files to three convenient machine-interpretable outputs to support biomedical text analytics. Firstly, Auto-CORPus can be configured to convert HTML from various publication sources to BioC. To standardize the description of heterogenous publication sections, the Information Artifact Ontology is used to annotate each section within the BioC output. Secondly, Auto-CORPus transforms publication tables to a JSON format to store, exchange and annotate table data between text analytics systems. The BioC specification does not include a data structure for representing publication table data, so we present a JSON format for sharing table content and metadata. Inline tables within full-text HTML files and linked tables within separate HTML files are processed and converted to machine-interpretable table JSON format. Finally, Auto-CORPus extracts abbreviations declared within publication text and provides an abbreviations JSON output that relates an abbreviation with the full definition. This abbreviation collection supports text mining tasks such as named entity recognition by including abbreviations unique to individual publications that are not contained within standard bio-ontologies and dictionaries. The Auto-CORPus package is freely available with detailed instructions from GitHub at: https://github.com/omicsNLP/Auto-CORPus.

Keywords: biomedical literature; health data; natural language processing; semantics; text mining.

PubMed Disclaimer

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

**Figure 1**
An extract of the Auto-CORPus BioC JSON created from the PMC3606015 full-text HTML file. Each section is annotated with IAO terms. The “autocorpus_fulltext.key” file describes the contents of the full-text JSON file (https://github.com/omicsNLP/Auto-CORPus/blob/main/keyFiles/autocorpus_fulltext.key).

**Figure 2**
An extract from the Auto-CORPus abbreviations JSON created from the PMC4068805 full-text HTML file. For each abbreviation the corresponding long form definition is given along with the algorithm(s) used to detect the abbreviation. Most of the abbreviations shown were independently identified in both the full-text and in the abbreviations section of the publication. A variation in the definition of “RP” was detected: in the abbreviations section this was defined as “reverse phase,” however in the full-text this was defined as “reversed phase.” The “autocorpus_abbreviations.key” file describes the contents of the abbreviations JSON file (https://github.com/omicsNLP/Auto-CORPus/blob/main/keyFiles/autocorpus_abbreviations.key).

**Figure 3**
Flow diagram demonstrating the process of classifying publication sections with IAO terms. The unfiltered digraph is visualized in Supplementary Figure 1, and the process of combining DPGs and mapping unmapped nodes using anchor points in Supplementary Figure 2. DPG, directed path graph; G(V,E), graph(vertex, edge); IAO, information artifact ontology.

**Figure 4**
Unmapped nodes in the digraph (Figure 3) connected to “abstract” as ego node, excluding corpus specific nodes, grouped into different categories. Unlabeled nodes are titles of paragraphs in the main text.

**Figure 5**
Final digraph model used in Auto-CORPus to classify paragraphs after fuzzy matching to IAO terms (v2020-06-10). This model includes new (proposed) section terms and each section contains new synonyms identified in this analysis. “Associated Data” is included as this is a PMC-specific header found before abstracts and can be used to indicate the start of most articles, all IAO terms are indicated in orange.

**Figure 6**
Extracts of the Auto-CORPus table JSON file generated to store metadata and content for an example table. **(A)** The parts of a table stored in table JSON. The section titles are underlined. The table shown is the PMC version (PMC4245044) of Table 1 from (15). **(B)** The title and caption table metadata stored in table JSON. **(C)** Each column heading in the table content is split between two rows, so the strings from both cells are concatenated with a pipe symbol in the table JSON. Headers that span multiple columns of sub-headers are replicated in each header cell as here with the pipe symbol. **(D)** The table content for the first row from the first section is shown in table JSON. Superscript characters are identified using HTML markup. **(E)** The footer table metadata stored in table JSON. The “autocorpus_tables.key” file describes the contents of the tables JSON file (https://github.com/omicsNLP/Auto-CORPus/blob/main/keyFiles/autocorpus_tables.key).

See this image and copyright information in PMC

References

1. Sheikhalishahi S, Miotto R, Dudley JT, Lavelli A, Rinaldi F, Osmani V. Natural language processing of clinical notes on chronic diseases: systematic review. JMIR Med Inform. (2019) 7:e12239. 10.2196/12239 - DOI - PMC - PubMed
1. Jackson RG, Patel R, Jayatilleke N, Kolliakou A, Ball M, Gorrell G, et al. Natural language processing to extract symptoms of severe mental illness from clinical text: the Clinical Record Interactive Search Comprehensive Data Extraction (CRIS-CODE) project. BMJ Open. (2017) 7:e012012. 10.1136/bmjopen-2016-012012 - DOI - PMC - PubMed
1. Erhardt RA, Schneider R, Blaschke C. Status of text-mining techniques applied to biomedical text. Drug Discov Today. (2006) 11:315–25. 10.1016/j.drudis.2006.02.011 - DOI - PubMed
1. Wang LL, Cachola I, Bragg J, Yu-Yen Cheng E, Haupt C, Latzke M, et al. Improving the accessibility of scientific documents: current state, user needs, and a system solution to enhance scientific PDF accessibility for blind and low vision users. arXiv e-prints: arXiv:2105.00076 (2021). Available online at: https://arxiv.org/pdf/2105.00076.pdf
1. Comeau DC, Islamaj Dogan R, Ciccarese P, Cohen KB, Krallinger M, Leitner F, et al. BioC: a minimalist approach to interoperability for biomedical text processing. Database. (2013) 2013:bat064. 10.1093/database/bat064 - DOI - PMC - PubMed

Grants and funding

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Auto-CORPus: A Natural Language Processing Tool for Standardizing and Reusing Biomedical Literature

Affiliations

Auto-CORPus: A Natural Language Processing Tool for Standardizing and Reusing Biomedical Literature

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Grants and funding

LinkOut - more resources

Full Text Sources