Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Apr 29;20(1):216.
doi: 10.1186/s12859-019-2801-x.

Automated assessment of biological database assertions using the scientific literature

Affiliations

Automated assessment of biological database assertions using the scientific literature

Mohamed Reda Bouadjenek et al. BMC Bioinformatics. .

Abstract

Background: The large biological databases such as GenBank contain vast numbers of records, the content of which is substantively based on external resources, including published literature. Manual curation is used to establish whether the literature and the records are indeed consistent. We explore in this paper an automated method for assessing the consistency of biological assertions, to assist biocurators, which we call BARC, Biocuration tool for Assessment of Relation Consistency. In this method a biological assertion is represented as a relation between two objects (for example, a gene and a disease); we then use our novel set-based relevance algorithm SaBRA to retrieve pertinent literature, and apply a classifier to estimate the likelihood that this relation (assertion) is correct.

Results: Our experiments on assessing gene-disease relations and protein-protein interactions using the PubMed Central collection show that BARC can be effective at assisting curators to perform data cleansing. Specifically, the results obtained showed that BARC substantially outperforms the best baselines, with an improvement of F-measure of 3.5% and 13%, respectively, on gene-disease relations and protein-protein interactions. We have additionally carried out a feature analysis that showed that all feature types are informative, as are all fields of the documents.

Conclusions: BARC provides a clear benefit for the biocuration community, as there are no prior automated tools for identifying inconsistent assertions in large-scale biological databases.

Keywords: Biological Databases; Data Analysis; Data Cleansing; Data Quality.

PubMed Disclaimer

Conflict of interest statement

Ethics approval and consent to participate

Not Applicable.

Consent for publication

Not Applicable.

Competing interests

KV is a member of the Editorial board of BMC Bioinformatics.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Figures

Fig. 1
Fig. 1
Growth of the number of sequences in UniProt databases. The green and pink lines shows the growth in UniProtKB for TrEMBL and Swiss-Prot respectively entries from January 2012 to January of 2019. The sharp drop in TrEMBL entries corresponds to a proteome redundancy minimization procedure implemented in March 2015 [5]. a Growth of TrEMBL. b Growth of Swiss-Prot
Fig. 2
Fig. 2
Distribution of co-mention frequencies for in/correct relations described in “Experimental data” section. It is apparent that even when two entities are not known to have a valid relationship (an incorrect relation), these entities may often be mentioned together in a text (co-mentioned)
Fig. 3
Fig. 3
Architecture overview of BARC
Fig. 4
Fig. 4
Toy example for building the context similarity matrix of the relational statement (CFTR, causes, CF) from the top 5 documents returned by SaBRA. Similarities are computed between the contexts of each occurrence of the two entities in the top documents. Aggregation values are then computed based on the obtained matrix to construct a feature vector
Fig. 5
Fig. 5
Comparison of SaBRA with TF-IDF and BM25 scoring functions using gene-disease relations. a Precision for correct statements. b Recall for correct statements. c Classification Accuracy. d ROC AUC. e ROC K =1. f ROC K =2. g ROC K =3. h ROC K =5. i ROC K =10. j ROC K =15. k ROC K =20. l ROC K =25. m ROC K =30
Fig. 6
Fig. 6
Comparison of SaBRA with TF-IDF and BM25 scoring functions using protein-protein interactions. a Precision for correct statements. b Recall for correct statements. c Classification Accuracy. d ROC AUC. e ROC K =1. f ROC K =2. g ROC K =3. h ROC K =5. i ROC K =10. j ROC K =15. k ROC K =20. l ROC K =25. m ROC K =30
Fig. 7
Fig. 7
Performance comparison of SaBRA on different relations with different document support values for k=30. a Performance comparison for different document support values on gene-disease relations. b Distribution of gene-disease relations. c Performance comparison for different document support values on protein-protein interactions. d Distribution of protein-protein interactions
Fig. 8
Fig. 8
Performance comparison of BARC for k=30. DT: Decision Tree-based classification. a Performance comparison on gene–disease relations. b Performance comparison on protein–protein interactions
Fig. 9
Fig. 9
Feature analysis using MI. The higher the density color, the higher the MI value. a Gene-Disease relations. b Protein-Protein interactions
Fig. 10
Fig. 10
Feature ablation analysis (k=30). a Performance on gene–disease relations. b Performance on protein–protein interactions

References

    1. Baxevanis AD, Bateman A. The importance of biological databases in biological discovery. Curr Protocol Bioinforma. 2015;50(1):1. - PubMed
    1. Bateman A. Curators of the world unite: the international society of biocuration. Bioinformatics. 2010;26(8):991. - PubMed
    1. NCBI Resource Coordinators Database resources of the national center for biotechnology information. Nucleic Acids Res. 2017;45(D1):D12–7. - PMC - PubMed
    1. Poux S, Magrane M, Arighi CN, Bridge A, O’Donovan C, Laiho K, The UniProt Consortium Expert curation in UniProtKB: a case study on dealing with conflicting and erroneous data. Database. 2014;2014:bau016. - PMC - PubMed
    1. The UniProt Consortium UniProt: the universal protein knowledgebase. Nucleic Acids Res. 2017;45(D1):D158–69. - PMC - PubMed

LinkOut - more resources