. 2019 Apr 29;20(1):216.

doi: 10.1186/s12859-019-2801-x.

Automated assessment of biological database assertions using the scientific literature

Mohamed Reda Bouadjenek¹, Justin Zobel², Karin Verspoor²

Affiliations

¹ Department of Mechanical & Industrial Engineering, University of Toronto, Toronto, M5S 3G8, Canada. mrb@mie.utoronto.ca.
² School of Computing and Information Systems, University of Melbourne, Melbourne, 3010, Australia.

PMID: 31035936
PMCID: PMC6489365
DOI: 10.1186/s12859-019-2801-x

Automated assessment of biological database assertions using the scientific literature

Mohamed Reda Bouadjenek et al. BMC Bioinformatics. 2019.

. 2019 Apr 29;20(1):216.

doi: 10.1186/s12859-019-2801-x.

Authors

Mohamed Reda Bouadjenek¹, Justin Zobel², Karin Verspoor²

Affiliations

¹ Department of Mechanical & Industrial Engineering, University of Toronto, Toronto, M5S 3G8, Canada. mrb@mie.utoronto.ca.
² School of Computing and Information Systems, University of Melbourne, Melbourne, 3010, Australia.

PMID: 31035936
PMCID: PMC6489365
DOI: 10.1186/s12859-019-2801-x

Abstract

Background: The large biological databases such as GenBank contain vast numbers of records, the content of which is substantively based on external resources, including published literature. Manual curation is used to establish whether the literature and the records are indeed consistent. We explore in this paper an automated method for assessing the consistency of biological assertions, to assist biocurators, which we call BARC, Biocuration tool for Assessment of Relation Consistency. In this method a biological assertion is represented as a relation between two objects (for example, a gene and a disease); we then use our novel set-based relevance algorithm SaBRA to retrieve pertinent literature, and apply a classifier to estimate the likelihood that this relation (assertion) is correct.

Results: Our experiments on assessing gene-disease relations and protein-protein interactions using the PubMed Central collection show that BARC can be effective at assisting curators to perform data cleansing. Specifically, the results obtained showed that BARC substantially outperforms the best baselines, with an improvement of F-measure of 3.5% and 13%, respectively, on gene-disease relations and protein-protein interactions. We have additionally carried out a feature analysis that showed that all feature types are informative, as are all fields of the documents.

Conclusions: BARC provides a clear benefit for the biocuration community, as there are no prior automated tools for identifying inconsistent assertions in large-scale biological databases.

Keywords: Biological Databases; Data Analysis; Data Cleansing; Data Quality.

PubMed Disclaimer

Conflict of interest statement

Ethics approval and consent to participate

Not Applicable.

Consent for publication

Not Applicable.

Competing interests

KV is a member of the Editorial board of BMC Bioinformatics.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Figures

**Fig. 1**
Growth of the number of sequences in UniProt databases. The green and pink lines shows the growth in UniProtKB for TrEMBL and Swiss-Prot respectively entries from January 2012 to January of 2019. The sharp drop in TrEMBL entries corresponds to a proteome redundancy minimization procedure implemented in March 2015 [5]. a Growth of TrEMBL. b Growth of Swiss-Prot

**Fig. 2**
Distribution of co-mention frequencies for *in/correct* relations described in “Experimental data” section. It is apparent that even when two entities are not known to have a valid relationship (an *incorrect* relation), these entities may often be mentioned together in a text (co-mentioned)

**Fig. 3**
Architecture overview of BARC

**Fig. 4**
Toy example for building the context similarity matrix of the relational statement (*CFTR*, *causes*, CF) from the top 5 documents returned by SaBRA. Similarities are computed between the contexts of each occurrence of the two entities in the top documents. Aggregation values are then computed based on the obtained matrix to construct a feature vector

**Fig. 5**
Comparison of *SaBRA* with TF-IDF and BM25 scoring functions using gene-disease relations. a Precision for correct statements. b Recall for correct statements. c Classification Accuracy. d ROC AUC. e ROC K =1. f ROC K =2. g ROC K =3. h ROC K =5. i ROC K =10. j ROC K =15. k ROC K =20. l ROC K =25. m ROC K =30

**Fig. 6**
Comparison of *SaBRA* with TF-IDF and BM25 scoring functions using protein-protein interactions. a Precision for correct statements. b Recall for correct statements. c Classification Accuracy. d ROC AUC. e ROC K =1. f ROC K =2. g ROC K =3. h ROC K =5. i ROC K =10. j ROC K =15. k ROC K =20. l ROC K =25. m ROC K =30

**Fig. 7**
Performance comparison of *SaBRA* on different relations with different document support values for k=30. a Performance comparison for different document support values on gene-disease relations. b Distribution of gene-disease relations. c Performance comparison for different document support values on protein-protein interactions. d Distribution of protein-protein interactions

**Fig. 8**
Performance comparison of BARC for k=30. DT: Decision Tree-based classification. a Performance comparison on gene–disease relations. b Performance comparison on protein–protein interactions

**Fig. 9**
Feature analysis using MI. The higher the density color, the higher the MI value. a Gene-Disease relations. b Protein-Protein interactions

**Fig. 10**
Feature ablation analysis (k=30). a Performance on gene–disease relations. b Performance on protein–protein interactions

See this image and copyright information in PMC

References

1. Baxevanis AD, Bateman A. The importance of biological databases in biological discovery. Curr Protocol Bioinforma. 2015;50(1):1. - PubMed
1. Bateman A. Curators of the world unite: the international society of biocuration. Bioinformatics. 2010;26(8):991. - PubMed
1. NCBI Resource Coordinators Database resources of the national center for biotechnology information. Nucleic Acids Res. 2017;45(D1):D12–7. - PMC - PubMed
1. Poux S, Magrane M, Arighi CN, Bridge A, O’Donovan C, Laiho K, The UniProt Consortium Expert curation in UniProtKB: a case study on dealing with conflicting and erroneous data. Database. 2014;2014:bau016. - PMC - PubMed
1. The UniProt Consortium UniProt: the universal protein knowledgebase. Nucleic Acids Res. 2017;45(D1):D158–69. - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

DP150101550/Australian Research Council

LinkOut - more resources

Full Text Sources
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Automated assessment of biological database assertions using the scientific literature

Affiliations

Automated assessment of biological database assertions using the scientific literature

Authors

Affiliations

Abstract

Conflict of interest statement

Ethics approval and consent to participate

Consent for publication

Competing interests

Publisher’s Note

Figures

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous