Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Sep 27:2023:baad061.
doi: 10.1093/database/baad061.

Building a large gene expression-cancer knowledge base with limited human annotations

Affiliations

Building a large gene expression-cancer knowledge base with limited human annotations

Stefano Marchesin et al. Database (Oxford). .

Abstract

Cancer prevention is one of the most pressing challenges that public health needs to face. In this regard, data-driven research is central to assist medical solutions targeting cancer. To fully harness the power of data-driven research, it is imperative to have well-organized machine-readable facts into a knowledge base (KB). Motivated by this urgent need, we introduce the Collaborative Oriented Relation Extraction (CORE) system for building KBs with limited manual annotations. CORE is based on the combination of distant supervision and active learning paradigms and offers a seamless, transparent, modular architecture equipped for large-scale processing. We focus on precision medicine and build the largest KB on 'fine-grained' gene expression-cancer associations-a key to complement and validate experimental data for cancer research. We show the robustness of CORE and discuss the usefulness of the provided KB. Database URL https://zenodo.org/record/7577127.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Overview of the CORE architecture. The system consists of five main modules and three processes. The modules represent the data acquisition and NERD components (1), the manual annotation activities (2), the training of the RE models (3), the subsequent automatic annotation (4), and the KB population (5). The processes reflect the different workflows: bootstrapping (orange) sets up the KBC process via expert involvement; deployment (blue) scales it through automated RE methods; and active learning (purple) allows refining the process through subsequent iterations.
Figure 2.
Figure 2.
Detailed view of the CORE architecture. In module (1), CORE acquires text from biomedical literature and then performs NERD to generate entity-annotated sentences. These sentences are then manually annotated by experts in module (2) to produce relation-annotated sentences, which are used to generate the datasets for training RE methods in module (3). Once trained, in module (4), the RE methods are deployed over entity-annotated sentences to automatically generate relation-annotated sentences. Finally, in module (5), relation-annotated sentences undergo a knowledge enrichment component, which generates facts, and a reliability testing component, which tags facts as ‘reliable’ or ‘unreliable’. Facts tagged as ‘reliable’ are used to populate the KB, whereas ‘unreliable’ facts are returned to experts for re-annotation.
Figure 3.
Figure 3.
The ten most involved genes (and their roles) in cancer diseases. From left to right, the figures present the ten most involved oncogenes, biomarkers and tumor suppressor genes, respectively. AKT1 is the most prominent oncogene, with wide expression in various tissues. Other known oncogenes include MAPK1, MAPK3 and STAT3. Proto-oncogenes such as ERBB2, EGFR and BCL2 show altered expression levels in cancer, but lack sufficient evidence to be identified as oncogenes, thus fitting our definition of biomarkers. TP53 represents an interesting case, as it functions as a biomarker and a tumor suppressor gene for several diseases, with its classification evolving over time.
Figure 4.
Figure 4.
The ten most discussed genes, cancer diseases, and facts within the literature. The most discussed genes are those most involved in cancer diseases, with a focus on breast, colorectal, prostate, and lung cancer—i.e., the most common cancer types worldwide. Consequently, the most discussed facts refer to gene expression-cancer associations involving these specific genes and diseases.
Figure 5.
Figure 5.
Temporal progression of publications concerning the longest-discussed fact in literature: (ERBB2, BIOMARKER, Mammary Neoplasms). ERBB2 is a known proto-oncogene, amplified or overexpressed in around 30% of human breast cancers (73). Its relevance in breast cancer justifies the prominent presence of the corresponding fact in the scientific discourse.
Figure 6.
Figure 6.
COREKB Search Engine Result Page first result for the query ‘AKT1 oncogene mammary neoplasms’. The retrieved facts are organized as cards providing several information concerning (A) the gene, cancer and their relationship and (B) specific information concerning the entities—i.e. gene and the related cancer expression—involved in the association. In addition, card (A) includes infometrics and bibliometrics information to provide further insights. The contents of the cards are available for download in JSON format through the dedicated download button.

References

    1. Manzoni C., Kia D.A., Vandrovcova J.. et al. (2016) Genome, transcriptome and proteome: the rise of omics data and their integration in biomedical sciences. Brief. Bioinformatics, 19, 286–302. - PMC - PubMed
    1. Borry P., Bentzen H.B., Budin-Ljøsne I.. et al. (2018) The challenges of the expanded availability of genomic information: an agenda-setting paper. J. Community Genet., 9, 103–116. - PMC - PubMed
    1. Neary B., Zhou J. and Qiu P. (2021) Identifying gene expression patterns associated with drug-specific survival in cancer patients. Sci. Rep., 11, 1–12. - PMC - PubMed
    1. Dugger S., Platt A. and Goldstein D. (2018) Drug development in the era of precision medicine. Nat. Rev. Drug. Discov., 17, 183–196. - PMC - PubMed
    1. Li X. and Warner J.L. (2020) A review of precision oncology knowledgebases for determining the clinical actionability of genetic variants. Front. Cell Dev. Biol., 8, 1–48. - PMC - PubMed

Publication types