PubMedPortable: A Framework for Supporting the Development of Text Mining Applications

doi:10.1371/journal.pone.0163794

. 2016 Oct 5;11(10):e0163794.

doi: 10.1371/journal.pone.0163794. eCollection 2016.

PubMedPortable: A Framework for Supporting the Development of Text Mining Applications

Kersten Döring¹, Björn A Grüning², Kiran K Telukunta², Philippe Thomas³, Stefan Günther¹

Affiliations

¹ Pharmaceutical Bioinformatics, Institute of Pharmaceutical Sciences, Albert-Ludwigs University, 79104 Freiburg, Germany.
² Bioinformatics, Institute of Computer Science, Albert-Ludwigs University, 79110 Freiburg, Germany.
³ Language Technology Lab, German Research Center for Artificial Intelligence, DFKI GmbH, 10559 Berlin, Germany.

PMID: 27706202
PMCID: PMC5051953
DOI: 10.1371/journal.pone.0163794

PubMedPortable: A Framework for Supporting the Development of Text Mining Applications

Kersten Döring et al. PLoS One. 2016.

. 2016 Oct 5;11(10):e0163794.

doi: 10.1371/journal.pone.0163794. eCollection 2016.

Authors

Kersten Döring¹, Björn A Grüning², Kiran K Telukunta², Philippe Thomas³, Stefan Günther¹

Affiliations

¹ Pharmaceutical Bioinformatics, Institute of Pharmaceutical Sciences, Albert-Ludwigs University, 79104 Freiburg, Germany.
² Bioinformatics, Institute of Computer Science, Albert-Ludwigs University, 79110 Freiburg, Germany.
³ Language Technology Lab, German Research Center for Artificial Intelligence, DFKI GmbH, 10559 Berlin, Germany.

PMID: 27706202
PMCID: PMC5051953
DOI: 10.1371/journal.pone.0163794

Abstract

Information extraction from biomedical literature is continuously growing in scope and importance. Many tools exist that perform named entity recognition, e.g. of proteins, chemical compounds, and diseases. Furthermore, several approaches deal with the extraction of relations between identified entities. The BioCreative community supports these developments with yearly open challenges, which led to a standardised XML text annotation format called BioC. PubMed provides access to the largest open biomedical literature repository, but there is no unified way of connecting its data to natural language processing tools. Therefore, an appropriate data environment is needed as a basis to combine different software solutions and to develop customised text mining applications. PubMedPortable builds a relational database and a full text index on PubMed citations. It can be applied either to the complete PubMed data set or an arbitrary subset of downloaded PubMed XML files. The software provides the infrastructure to combine stand-alone applications by exporting different data formats, e.g. BioC. The presented workflows show how to use PubMedPortable to retrieve, store, and analyse a disease-specific data set. The provided use cases are well documented in the PubMedPortable wiki. The open-source software library is small, easy to use, and scalable to the user's system requirements. It is freely available for Linux on the web at https://github.com/KerstenDoering/PubMedPortable and for other operating systems as a virtual container. The approach was tested extensively and applied successfully in several projects.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

**Fig 1. PubMedPortable workflow.**
1) Download XML files from PubMed. 2) Parse and upload data into a PostgreSQL relational database. 3) Build a Xapian full text index. 4) Develop text mining applications.

**Fig 2. General BioC workflow.**
This is the minimalistic approach from Comeau *et al*. [10] with the example how to add MeSH terms to BioC PubMed titles and abstracts from the PubMedPortable PostgreSQL database.

**Fig 3. Excerpt of a BioC XML document.**
The document ID 100475 is a PubMed ID. PubTator annotations are shown with infon elements that contain the key *type* with the value *Disease* and the key *MEDIC* referring to a MeSH ID, such as *D010190* for the given disease *pancreatic carcinoma*. The PubMedPortable MeSH term annotations are shown with the annotation IDs *0_MeSH* and *1_MeSH* to make them distinguishable from the normally iterating PubTator annotation IDs. They were added after calling the PubTator web service.

**Fig 4. Read BioC elements.**
All BioC XML elements can be read with the BioC API. The script refers to the left part of the workflow shown in Fig 2. Iterating over the given annotations as shown in Fig 3 will e.g. show *Annotation ID: 0*, *Annotation Type: Disease*, *Annotation Text: pancreatic carcinoma*, and *Offset and term length: 77:20*.

**Fig 5. Documentation to generate a word cloud using PubMedPortable.**
Different tools and different data formats might be used for named-entity recognition. Tab-separated files (CSV) with PubMed ID, synonym, and identifier in each line are used to collect all abstracts in which a match for identifier-specific synonyms appeared.

**Fig 6. Genes, proteins, chemicals, and diseases related to pancreatic cancer.**
The 150 most frequently appearing entities in terms of their number of abstracts were identified with DNorm [28], GeneTUKit [27], and PubTator [16]. Fig 5 shows the steps to generate this word cloud.

**Fig 7. Timelines for the publications of the genes *KRAS*, *BRCA2*, and *CDKN2A* until 2014.**
The PubMed IDs for these three genes were extracted from the list of entities resulting from step 4 in Fig 5. The publication years were selected from the PubMedPortable database.

**Fig 8. Boolean query result.**
The HTML page shows a rank in the first column with a relative match score, scaled to 100. The NEAR condition was used to allow up to four other words between the drug erlotinib and the disease term pancreatic cancer without fixed word order.

See this image and copyright information in PMC

Cited by

A semantic-based workflow for biomedical literature annotation.
Sernadela P, Oliveira JL. Sernadela P, et al. Database (Oxford). 2017 Jan 1;2017:bax088. doi: 10.1093/database/bax088. Database (Oxford). 2017. PMID: 29220478 Free PMC article.
Automated recognition of functional compound-protein relationships in literature.
Döring K, Qaseem A, Becer M, Li J, Mishra P, Gao M, Kirchner P, Sauter F, Telukunta KK, Moumbock AFA, Thomas P, Günther S. Döring K, et al. PLoS One. 2020 Mar 3;15(3):e0220925. doi: 10.1371/journal.pone.0220925. eCollection 2020. PLoS One. 2020. PMID: 32126064 Free PMC article.

References

1. Khare R, Leaman R, Lu Z. Accessing Biomedical Literature in the Current Information Landscape In: Biomedical Literature Mining. vol. 1159 New York, NY: Springer New York; 2014. p. 11–31. 10.1007/978-1-4939-0709-0_2 - DOI - PMC - PubMed
1. Tikk D, Solt I, Thomas P, Leser U. A detailed error analysis of 13 kernel methods for protein–protein interaction extraction. BMC Bioinformatics. 2013;14(1):12 10.1186/1471-2105-14-12 - DOI - PMC - PubMed
1. Tari L, Anwar S, Liang S, Cai J, Baral C. Discovering drug-drug interactions: a text-mining and reasoning approach based on properties of drug metabolism. Bioinformatics. 2010. September;26(18):i547–i553. 10.1093/bioinformatics/btq382 - DOI - PMC - PubMed
1. Senger C, Grüning BA, Erxleben A, Döring K, Patel H, Flemming S, et al. Mining and evaluation of molecular relationships in literature. Bioinformatics. 2012. March;28(5):709–714. 10.1093/bioinformatics/bts026 - DOI - PubMed
1. Kuhn M, Szklarczyk D, Pletscher-Frankild S, Blicher TH, von Mering C, Jensen LJ, et al. STITCH 4: integration of protein–chemical interactions with user data. Nucleic Acids Research. 2014. January;42(D1):D401–D407. 10.1093/nar/gkt1207 - DOI - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations

[1] Khare R, Leaman R, Lu Z. Accessing Biomedical Literature in the Current Information Landscape In: Biomedical Literature Mining. vol. 1159 New York, NY: Springer New York; 2014. p. 11–31. 10.1007/978-1-4939-0709-0_2 - DOI - PMC - PubMed

[2] Khare R, Leaman R, Lu Z. Accessing Biomedical Literature in the Current Information Landscape In: Biomedical Literature Mining. vol. 1159 New York, NY: Springer New York; 2014. p. 11–31. 10.1007/978-1-4939-0709-0_2 - DOI - PMC - PubMed

[3] Tikk D, Solt I, Thomas P, Leser U. A detailed error analysis of 13 kernel methods for protein–protein interaction extraction. BMC Bioinformatics. 2013;14(1):12 10.1186/1471-2105-14-12 - DOI - PMC - PubMed

[4] Tikk D, Solt I, Thomas P, Leser U. A detailed error analysis of 13 kernel methods for protein–protein interaction extraction. BMC Bioinformatics. 2013;14(1):12 10.1186/1471-2105-14-12 - DOI - PMC - PubMed

[5] Tari L, Anwar S, Liang S, Cai J, Baral C. Discovering drug-drug interactions: a text-mining and reasoning approach based on properties of drug metabolism. Bioinformatics. 2010. September;26(18):i547–i553. 10.1093/bioinformatics/btq382 - DOI - PMC - PubMed

[6] Tari L, Anwar S, Liang S, Cai J, Baral C. Discovering drug-drug interactions: a text-mining and reasoning approach based on properties of drug metabolism. Bioinformatics. 2010. September;26(18):i547–i553. 10.1093/bioinformatics/btq382 - DOI - PMC - PubMed

[7] Senger C, Grüning BA, Erxleben A, Döring K, Patel H, Flemming S, et al. Mining and evaluation of molecular relationships in literature. Bioinformatics. 2012. March;28(5):709–714. 10.1093/bioinformatics/bts026 - DOI - PubMed

[8] Senger C, Grüning BA, Erxleben A, Döring K, Patel H, Flemming S, et al. Mining and evaluation of molecular relationships in literature. Bioinformatics. 2012. March;28(5):709–714. 10.1093/bioinformatics/bts026 - DOI - PubMed

[9] Kuhn M, Szklarczyk D, Pletscher-Frankild S, Blicher TH, von Mering C, Jensen LJ, et al. STITCH 4: integration of protein–chemical interactions with user data. Nucleic Acids Research. 2014. January;42(D1):D401–D407. 10.1093/nar/gkt1207 - DOI - PMC - PubMed

[10] Kuhn M, Szklarczyk D, Pletscher-Frankild S, Blicher TH, von Mering C, Jensen LJ, et al. STITCH 4: integration of protein–chemical interactions with user data. Nucleic Acids Research. 2014. January;42(D1):D401–D407. 10.1093/nar/gkt1207 - DOI - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

PubMedPortable: A Framework for Supporting the Development of Text Mining Applications

Affiliations

PubMedPortable: A Framework for Supporting the Development of Text Mining Applications

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources