Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Apr 21:8:e8871.
doi: 10.7717/peerj.8871. eCollection 2020.

Data-driven classification of the certainty of scholarly assertions

Affiliations

Data-driven classification of the certainty of scholarly assertions

Mario Prieto et al. PeerJ. .

Abstract

The grammatical structures scholars use to express their assertions are intended to convey various degrees of certainty or speculation. Prior studies have suggested a variety of categorization systems for scholarly certainty; however, these have not been objectively tested for their validity, particularly with respect to representing the interpretation by the reader, rather than the intention of the author. In this study, we use a series of questionnaires to determine how researchers classify various scholarly assertions, using three distinct certainty classification systems. We find that there are three distinct categories of certainty along a spectrum from high to low. We show that these categories can be detected in an automated manner, using a machine learning model, with a cross-validation accuracy of 89.2% relative to an author-annotated corpus, and 82.2% accuracy against a publicly-annotated corpus. This finding provides an opportunity for contextual metadata related to certainty to be captured as a part of text-mining pipelines, which currently miss these subtle linguistic cues. We provide an exemplar machine-accessible representation-a Nanopublication-where certainty category is embedded as metadata in a formal, ontology-based manner within text-mined scholarly assertions.

Keywords: Certainty; FAIR Data; Machine learning; Scholarly communication; Text mining.

PubMed Disclaimer

Conflict of interest statement

Helena Deus is employed by Elsevier Inc. Anita de Waard is Vice President, Research Data Collaborations at Elsevier. Erik Schultes is employed by the GO FAIR International Support and Coordination Office.

Figures

Figure 1
Figure 1. How a claim becomes a fact.
These statements represent a series of scholarly assertions about the same biological phenomenon, revealing that the core assertion transforms from a hedging statement into statements resembling fact through several steps, but without additional evidence (de Waard, 2012).
Figure 2
Figure 2. Example of the Survey 1 questionnaire interface.
A scholarly assertion is highlighted in blue, in its original context. Participants are asked to characterize the blue assertion, using one of four levels of certainty (High, Medium High, Medium Low or Low).
Figure 3
Figure 3. Spearman Rank Correlation and hierarchically-clustered heatmap on ranked transformed values comparing the statements assigned to the Certainty Categories among all three questionnaires.
The clustering tree and heatmap are based on participants’ responses adjusted by rank transformation from questionnaires S1, S2 and S3. Certainty categories clustered hierarchically. Boxes shows color legend and coefficients based on Spearman’s rank-order correlation of the certainty categories.
Figure 4
Figure 4. Majority rule output for deciding optimal number of clusters (k) in the three surveys.
(A) Majority rule indicates three clusters for Survey 1. (B) Majority rule indicates two clusters for Survey 2, though there is notable support for three clusters. (C) Majority rule indicates three clusters for Survey 3, with notable support for two clusters.
Figure 5
Figure 5. Principal component analysis of questionnaire responses in the three surveys.
Bi-plot of certainty level distribution over results from k-means clustering (colors) for: Survey 1 (A), Survey 2 with three clusters (B), Survey 2 with two clusters (C) and Survey 3 (D). Each dot represents a statement. Red lines are the eigenvectors for each component.
Figure 6
Figure 6. Automated classification of scholarly assertions related to the accumulation of beta-APP protein in muscle fibres, color coded as green (Category A: highest certainty), orange (Category B: medium certainty) and red (Category C: lowest certainty).
(A and B) Two citation chains showing that the degree of certainty expressed in the most recent statement is higher than that in the cited text. (C) A selection of statements identified by Greenberg (2009), as being potentially indicative of ‘citation distortion’. In this panel, there is a general trend to higher certainty over time, with the exception of an early high-certainty statement by Mastaglia in 2003 (second row from the bottom).
Figure 7
Figure 7. An exemplar prototype NanoPublication including certainty annotations.
The figure shows how certainty classifications could be used as additional, and important metadata when added to text-mining pipelines. A NanoPublication is a machine-readable representation of a scholarly assertion, carrying with it all of its provenance. In this exemplar (hypothetical) NanoPublication for statement #29 in this study, the concept being asserted (that microRNA mir-155 has the function of a Tumor Suppressor) is captured using ontologically-based concepts in the “assertion” block of the NanoPublication (red text), together with the proposed annotation of that statement’s certainty category (blue text) being ORCA-X Category A. This could be used, for example, to filter assertions based on the degree of certainty they express. The final block, pubinfo, contains authorship, license, and citation information for the NanoPublication itself, expressing the terms of usage of this metadata, and who to cite (green text). This entire structure can be interpreted by automated agents, and fully complies with the FAIR Data Principles.

References

    1. Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, Corrado GS, Davis A, Dean J, Devin M, Ghemawat S, Goodfellow IJ, Harp A, Irving G, Isard M, Jia Y, Józefowicz R, Kaiser L, Kudlur M, Levenberg J, Mané D, Monga R, Moore S, Murray DG, Olah C, Schuster M, Shlens J, Steiner B, Sutskever I, Talwar K, Tucker PA, Vanhoucke V, Vasudevan V, Viégas FB, Vinyals O, Warden P, Wattenberg M, Wicke M, Yu Y, Zheng X. TensorFlow: large-scale machine learning on heterogeneous distributed systems. CoRR. 2016 abs/1603.04467.
    1. Agami R, Bernards R. Distinct initiation and maintenance mechanisms cooperate to induce G1 cell cycle arrest in response to DNA damage. Cell. 2000;102:55–66. doi: 10.1016/S0092-8674(00)00010-6. - DOI - PubMed
    1. Baxter MJ. Standardization and transformation in principal component analysis, with applications to archaeometry. Applied Statistics; 1995.
    1. Baziotis C, Pelekis N, Doulkeridis C. Datastories at semeval-2017 task 4: deep lstm with attention for message-level and topic-based sentiment analysis. Proceedings of the 11th international workshop on semantic evaluation (SemEval-2017); 2017. pp. 747–754.
    1. Campbell J, Narayanan A, Burford B, Greco M. Validation of a multi-source feedback tool for use in general practice. Education for Primary Care. 2010;21:165–179. doi: 10.1080/14739879.2010.11493902. - DOI - PubMed

LinkOut - more resources