Building a large gene expression-cancer knowledge base with limited human annotations

Stefano Marchesin¹, Laura Menotti¹, Fabio Giachelle¹, Gianmaria Silvello¹, Omar Alonso²

Affiliations

¹ Department of Information Engineering, University of Padova, Via G. Gradenigo 6b, Padova 35131, Italy.
² Applied Science, Amazon, 3075 Olcott St., Santa Clara, California 95054, USA.

PMID: 37768281
PMCID: PMC10533344
DOI: 10.1093/database/baad061

Building a large gene expression-cancer knowledge base with limited human annotations

Stefano Marchesin et al. Database (Oxford). 2023.

. 2023 Sep 27:2023:baad061.

doi: 10.1093/database/baad061.

Authors

Stefano Marchesin¹, Laura Menotti¹, Fabio Giachelle¹, Gianmaria Silvello¹, Omar Alonso²

Affiliations

¹ Department of Information Engineering, University of Padova, Via G. Gradenigo 6b, Padova 35131, Italy.
² Applied Science, Amazon, 3075 Olcott St., Santa Clara, California 95054, USA.

PMID: 37768281
PMCID: PMC10533344
DOI: 10.1093/database/baad061

Abstract

Cancer prevention is one of the most pressing challenges that public health needs to face. In this regard, data-driven research is central to assist medical solutions targeting cancer. To fully harness the power of data-driven research, it is imperative to have well-organized machine-readable facts into a knowledge base (KB). Motivated by this urgent need, we introduce the Collaborative Oriented Relation Extraction (CORE) system for building KBs with limited manual annotations. CORE is based on the combination of distant supervision and active learning paradigms and offers a seamless, transparent, modular architecture equipped for large-scale processing. We focus on precision medicine and build the largest KB on 'fine-grained' gene expression-cancer associations-a key to complement and validate experimental data for cancer research. We show the robustness of CORE and discuss the usefulness of the provided KB. Database URL https://zenodo.org/record/7577127.

PubMed Disclaimer

Figures

**Figure 1.**
Overview of the CORE architecture. The system consists of five main modules and three processes. The modules represent the data acquisition and NERD components (1), the manual annotation activities (2), the training of the RE models (3), the subsequent automatic annotation (4), and the KB population (5). The processes reflect the different workflows: bootstrapping (orange) sets up the KBC process via expert involvement; deployment (blue) scales it through automated RE methods; and active learning (purple) allows refining the process through subsequent iterations.

**Figure 2.**
Detailed view of the CORE architecture. In module (1), CORE acquires text from biomedical literature and then performs NERD to generate entity-annotated sentences. These sentences are then manually annotated by experts in module (2) to produce relation-annotated sentences, which are used to generate the datasets for training RE methods in module (3). Once trained, in module (4), the RE methods are deployed over entity-annotated sentences to automatically generate relation-annotated sentences. Finally, in module (5), relation-annotated sentences undergo a knowledge enrichment component, which generates facts, and a reliability testing component, which tags facts as ‘reliable’ or ‘unreliable’. Facts tagged as *‘reliable’* are used to populate the KB, whereas *‘unreliable’* facts are returned to experts for re-annotation.

**Figure 3.**
The ten most involved genes (and their roles) in cancer diseases. From left to right, the figures present the ten most involved oncogenes, biomarkers and tumor suppressor genes, respectively. AKT1 is the most prominent oncogene, with wide expression in various tissues. Other known oncogenes include MAPK1, MAPK3 and STAT3. Proto-oncogenes such as ERBB2, EGFR and BCL2 show altered expression levels in cancer, but lack sufficient evidence to be identified as oncogenes, thus fitting our definition of biomarkers. TP53 represents an interesting case, as it functions as a biomarker and a tumor suppressor gene for several diseases, with its classification evolving over time.

**Figure 4.**
The ten most discussed genes, cancer diseases, and facts within the literature. The most discussed genes are those most involved in cancer diseases, with a focus on breast, colorectal, prostate, and lung cancer—i.e., the most common cancer types worldwide. Consequently, the most discussed facts refer to gene expression-cancer associations involving these specific genes and diseases.

**Figure 5.**
Temporal progression of publications concerning the longest-discussed fact in literature: (ERBB2, BIOMARKER, Mammary Neoplasms). ERBB2 is a known proto-oncogene, amplified or overexpressed in around 30% of human breast cancers (73). Its relevance in breast cancer justifies the prominent presence of the corresponding fact in the scientific discourse.

**Figure 6.**
COREKB Search Engine Result Page first result for the query ‘AKT1 oncogene mammary neoplasms’. The retrieved facts are organized as cards providing several information concerning (A) the gene, cancer and their relationship and (B) specific information concerning the entities—i.e. gene and the related cancer expression—involved in the association. In addition, card (A) includes infometrics and bibliometrics information to provide further insights. The contents of the cards are available for download in JSON format through the dedicated download button.

See this image and copyright information in PMC

References

1. Manzoni C., Kia D.A., Vandrovcova J.. et al. (2016) Genome, transcriptome and proteome: the rise of omics data and their integration in biomedical sciences. Brief. Bioinformatics, 19, 286–302. - PMC - PubMed
1. Borry P., Bentzen H.B., Budin-Ljøsne I.. et al. (2018) The challenges of the expanded availability of genomic information: an agenda-setting paper. J. Community Genet., 9, 103–116. - PMC - PubMed
1. Neary B., Zhou J. and Qiu P. (2021) Identifying gene expression patterns associated with drug-specific survival in cancer patients. Sci. Rep., 11, 1–12. - PMC - PubMed
1. Dugger S., Platt A. and Goldstein D. (2018) Drug development in the era of precision medicine. Nat. Rev. Drug. Discov., 17, 183–196. - PMC - PubMed
1. Li X. and Warner J.L. (2020) A review of precision oncology knowledgebases for determining the clinical actionability of genetic variants. Front. Cell Dev. Biol., 8, 1–48. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Medical
- MedlinePlus Health Information

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Building a large gene expression-cancer knowledge base with limited human annotations

Affiliations

Building a large gene expression-cancer knowledge base with limited human annotations

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Medical