Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Mar 11:9:e11071.
doi: 10.7717/peerj.11071. eCollection 2021.

pmparser and PMDB: resources for large-scale, open studies of the biomedical literature

Affiliations

pmparser and PMDB: resources for large-scale, open studies of the biomedical literature

Joshua L Schoenbachler et al. PeerJ. .

Abstract

PubMed is an invaluable resource for the biomedical community. Although PubMed is freely available, the existing API is not designed for large-scale analyses and the XML structure of the underlying data is inconvenient for complex queries. We developed an R package called pmparser to convert the data in PubMed to a relational database. Our implementation of the database, called PMDB, currently contains data on over 31 million PubMed Identifiers (PMIDs) and is updated regularly. Together, pmparser and PMDB can enable large-scale, reproducible, and transparent analyses of the biomedical literature. pmparser is licensed under GPL-2 and available at https://pmparser.hugheylab.org. PMDB is available in both PostgreSQL (DOI 10.5281/zenodo.4008109) and Google BigQuery (https://console.cloud.google.com/bigquery?project=pmdb-bq&d=pmdb).

Keywords: Database; Parsing; Publishing; Pubmed.

PubMed Disclaimer

Conflict of interest statement

Jacob J. Hughey is an Academic Editor for PeerJ.

Figures

Figure 1
Figure 1. Using PMDB to quantify the number of authors per publication between 1920 and 2020, grouped by year.
The database query involved joining the pub_history and author tables on the pmid field. The MEDLINE XML documentation states that for PMIDs created between 1984 and 1995, at most 10 authors were entered, and for PMIDs between 1996 and 1999, at most 25 authors were entered. These limits do not affect the median or interquartile range, but may introduce inaccuracies to the mean.
Figure 2
Figure 2. Using PMDB to quantify the percentage of author names with an ORCID identifier from January 2013 to August 2020, grouped by month.
The database query involved joining the pub_history, author, and author_identifier tables on the pmid and author_pos fields. ORCID identifiers became available in October 2012.

Similar articles

Cited by

References

    1. Abdill RJ, Blekhman R. Meta-research: tracking the popularity and outcomes of all bioRxiv preprints. eLife. 2019;8:838. doi: 10.7554/eLife.45133. - DOI - PMC - PubMed
    1. Achakulvisut T, Acuna D, Kording K. Pubmed parser: a python parser for pubmed open-access XML subset and MEDLINE XML dataset XML dataset. Journal of Open Source Software. 2020;5(46):1979. doi: 10.21105/joss.01979. - DOI
    1. Boyack KW, Newman D, Duhon RJ, Klavans R, Patek M, Biberstine JR, Schijvenaars B, Skupin A, Ma N, Börner K. Clustering more than two million biomedical publications: comparing the accuracies of nine text-based similarity approaches. PLOS ONE. 2011;6(3):e18029. doi: 10.1371/journal.pone.0018029. - DOI - PMC - PubMed
    1. Fu DY, Hughey JJ. Releasing a preprint is associated with more attention and citations for the peer-reviewed article. eLife. 2019;8:627. doi: 10.7554/eLife.52646. - DOI - PMC - PubMed
    1. Hutchins BI, Baker KL, Davis MT, Diwersy MA, Haque E, Harriman RM, Hoppe TA, Leicht SA, Meyer P, Santangelo GM. The NIH open citation collection: a public access, broad coverage resource. PLoS Biology. 2019a;17(10):e3000385. doi: 10.1371/journal.pbio.3000385. - DOI - PMC - PubMed

LinkOut - more resources