Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Apr 22;15(4):e0230416.
doi: 10.1371/journal.pone.0230416. eCollection 2020.

The citation advantage of linking publications to research data

Affiliations

The citation advantage of linking publications to research data

Giovanni Colavizza et al. PLoS One. .

Abstract

Efforts to make research results open and reproducible are increasingly reflected by journal policies encouraging or mandating authors to provide data availability statements. As a consequence of this, there has been a strong uptake of data availability statements in recent literature. Nevertheless, it is still unclear what proportion of these statements actually contain well-formed links to data, for example via a URL or permanent identifier, and if there is an added value in providing such links. We consider 531, 889 journal articles published by PLOS and BMC, develop an automatic system for labelling their data availability statements according to four categories based on their content and the type of data availability they display, and finally analyze the citation advantage of different statement categories via regression. We find that, following mandated publisher policies, data availability statements become very common. In 2018 93.7% of 21,793 PLOS articles and 88.2% of 31,956 BMC articles had data availability statements. Data availability statements containing a link to data in a repository-rather than being available on request or included as supporting information files-are a fraction of the total. In 2017 and 2018, 20.8% of PLOS publications and 12.2% of BMC publications provided DAS containing a link to data in a repository. We also find an association between articles that include statements that link to data in a repository and up to 25.36% (± 1.07%) higher citation impact on average, using a citation prediction model. We discuss the potential implications of these results for authors (researchers) and journal publishers who make the effort of sharing their data in repositories. All our data and code are made available in order to reproduce and extend our results.

PubMed Disclaimer

Conflict of interest statement

One of the authors (IH) is at the time of publication in the journal, employed by PLOS, publisher of PLOS ONE. IH was employed by Springer Nature, publisher of the BMC journals, at the time of planning and conducting the research and writing of the original manuscript. This does not alter our adherence to PLOS ONE policies on sharing data and materials. There are no patents, products in development or marketed products associated with this research to declare. All other authors have declared that no other competing interests exist.

Figures

Fig 1
Fig 1. Data extraction and processing steps.
We first downloaded the PubMed open access collection (1) and created a database with all articles with a known identifier and which contained at least one reference (2; N = 1, 969, 175). Next we identified and disambiguated authors of these papers (3; S = 4, 253, 172) and calculated citations for each author and each publication from within the collection (4). We used these citation counts to calculate a within-collection H-index for each author. Our analysis only focuses on PLOS and BMC publications as these publishers introduced mandated DAS, so we filtered the database for these articles and extracted DAS from each publication (5). We annotated a training dataset by labelling each of these statements into one of four categories (6) and used those labels to train a natural language processing classifier (7). Using this classifier we then categorised the remaining DAS in the database (8). Finally, we exported this categorised dataset of M = 531, 889 publications to a csv file (9) and archived it (see Data and code availability section below).
Fig 2
Fig 2. Data availability statements over time.
All the histograms above show the number of publications from specific subsets of the dataset and classify them into four categories: No DAS (0), Category 1 (data available on request), Category 2 (data contained within the article and supplementary materials), and Category 3 (a link to archived data in a public repository). The vertical solid line shows the date that the publisher introduced a mandated DAS policy. A dashed line indicates the date an encouraged policy was introduced. The groups of articles are as follows. A: all BMC articles, B: all PLOS articles, C: all BMC Series articles, D: PLOS One articles, E: PLOS articles not published in PLOS One, F: articles from the BMC Genomics journal (selected to illustrate a journal that had high uptake of an encouraged policy), G: articles from the Trials journal (published by BMC, selected to illustrate a journal that has a very high percentage of data that can only be made available by request to the authors), H: articles from the Parasites and Vectors journal (selected to illustrate a journal that has an even distribution of the three DAS categories). Articles are binned by publication year.

References

    1. Hodson S, Molloy L. Current Best Practice for Research Data Management Policies. 2015.
    1. New policy for structural data. Nature. 1998;394 (6689). - PubMed
    1. Jones L, Grant R, Hrynaszkiewicz I. Implementing publisher policies that inform, support and encourage authors to share data: two case studies. Insights the UKSG journal. 2019;32 10.1629/uksg.463 - DOI
    1. Hrynaszkiewicz I, Birukou A, Astell M, Swaminathan S, Kenall A, Khodiyar V. Standardising and harmonising research data policy in scholarly publishing. IJDC. 2017;12(1). 10.2218/ijdc.v12i1.531 - DOI
    1. Announcement: Where are the data? Nature. 2016;537 (7619). - PubMed

Publication types