Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Meta-Analysis
. 2020 Apr;52(4):448-457.
doi: 10.1038/s41588-020-0603-8. Epub 2020 Apr 3.

A harmonized meta-knowledgebase of clinical interpretations of somatic genomic variants in cancer

Affiliations
Meta-Analysis

A harmonized meta-knowledgebase of clinical interpretations of somatic genomic variants in cancer

Alex H Wagner et al. Nat Genet. 2020 Apr.

Abstract

Precision oncology relies on accurate discovery and interpretation of genomic variants, enabling individualized diagnosis, prognosis and therapy selection. We found that six prominent somatic cancer variant knowledgebases were highly disparate in content, structure and supporting primary literature, impeding consensus when evaluating variants and their relevance in a clinical setting. We developed a framework for harmonizing variant interpretations to produce a meta-knowledgebase of 12,856 aggregate interpretations. We demonstrated large gains in overlap between resources across variants, diseases and drugs as a result of this harmonization. We subsequently demonstrated improved matching between a patient cohort and harmonized interpretations of potential clinical significance, observing an increase from an average of 33% per individual knowledgebase to 57% in aggregate. Our analyses illuminate the need for open, interoperable sharing of variant interpretation data. We also provide a freely available web interface (search.cancervariants.org) for exploring the harmonized interpretations from these six knowledgebases.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Creation of a harmonized meta-knowledgebase.
Six variant interpretation knowledgebases of the VICC (top panel) and representative symbolic interpretations from each (colored columns) are illustrated. Interpretations are split across five different elements: gene, variant, disease, drugs and evidence, and are colored to indicate their originating knowledgebase. Reference-linked elements correspond to unique identifiers from established authorities for that element (for example, the use of Entrez or Ensembl gene identifiers). Standardized elements correspond to immediately recognizable formats or descriptions of elements, but are not linked to an authoritative definition. Resource-specific elements are described by terminology unique to the knowledgebase. These elements are each harmonized (bottom left panel) to a common reference standard (shown here is the use of HGNC for genes, ChEMBL for drugs, AMP/ASCO/CAP guidelines for evidence, Disease Ontology for diseases and ClinGen Allele Registry for variants). This harmonized meta-knowledgebase allows for querying across interpretations from each of the constituent VICC knowledgebases (bottom right panel, example query BRAF V600E), returning aggregated results, which are categorized and sorted by evidence level.
Fig. 2
Fig. 2. Representation of genomic variants across interpretation knowledgebases.
a, UpSet plot of variants across six cancer variant interpretation knowledgebases (KBs). Sets of variant interpretation knowledgebases with shared variants are indicated by colored dots in the lower panel, with color indicating set size (for example, yellow dots indicate only the single designated knowledgebase in the set, green dots indicate two knowledgebases in the set, etc.). Objects are attributed to the largest containing set; thus, a variant described by all six knowledgebases is attributed to the dark blue set with eight variants. b, Pie chart visualizing overall uniqueness of variants, with categories indicating the number of knowledgebases describing each variant. Nearly 77% of variants are unique across the knowledgebases, with only 0.2% ubiquitously represented. The eight variants present in all six knowledgebases are listed on the right. c, A comparison of element uniqueness across knowledgebases. Despite having the greatest degree of overlap across all elements, approximately 61% of genes are unique across the knowledgebases. Literature cited to support interpretations has the smallest degree of overlap across all elements, with 83% of publications remaining unique across the knowledgebases. *Drugs are not evaluated for PMKB, which does not formally represent this concept. d, Multiple syntactically valid representations of an identical protein product can lead to confusion in describing the change in the literature and in variant databases. The wild-type protein sequence (dark blue with orange lettering) is represented for ERBB2 (top). Two (of many) possible representations of an inframe insertion (orange with dark blue lettering) are shown (bottom). A nonstandard HGVS expression describes a five-amino-acid insertion replacing one glutamate residue (middle). At the bottom, the HGVS standard representation shows an identical protein product from a four-amino-acid duplication. A search for one representation against a database with another (nonoverlapping) representation may lead to omission of a clinically relevant finding.
Fig. 3
Fig. 3. Clinical interpretations of variants are defined by disease.
ac, Core dataset interpretations for top-level disease groups. Distinct diseases are shown if the constituent interpretations for that disease account for at least 5% of the total dataset (a). Diseases accounting for at least 5% of cancer incidence (b) and mortality (c) are also displayed. Approximately 8% of interpretations are categorized as benign neoplasms (dark gray; for example von Hippel–Lindau disease). An additional 1% are categorized under high-level terms other than DOID:14566, disease of cellular proliferation. d,e, Heat map of frequent gene–disease interpretations (d) and the related heat map limited to tier 1 interpretations (e). f, Percentage of Project GENIE cohort with at least one interpretation from the indicated knowledgebase that matches patient variants (left group), patient variants and disease (center group) or patient variants, disease and a tier I evidence level (right group). A broader search strategy (indicated by whisker bars; Extended Data Fig. 4) that allows for regional variant matches (for example, gene level) and broader interpretation of disease terms (for example, DOID:162, cancer) nearly doubles the number of patients with matching interpretations. These broader match strategies are incompatible with the ASCO/AMP/CAP evidence guidelines. g, Most significant finding (by evidence level) across patient samples, by disease. Each column represents one of the common diseases indicated in a, and the rows represent the evidence levels described in Table 1. Inner, light green circles (labeled Singular) indicate the proportion observed when matching patient diseases to interpretations with the same disease ontology term. Outer, dark green circles (labeled Grouped) indicate the proportion observed when matching patients to interpretations with ancestor or descendant terms that group to the same class of disease (Methods). Hem. cancer, hematological cancer; Lrg. int. cancer, large intestine cancer.
Fig. 4
Fig. 4. A web client for exploring the VICC meta-knowledgebase.
a, Queries are entered as individual terms, with compound queries (for example, BRAF and V600E) denoted by preceding ‘+’ characters. Usage help and example documentation can be found by clicking the ‘?’ icon. b, Result visualization panels are interactive, allowing users to quickly filter results by evidence level, source, disease, drug and gene. c, Scrollable results table has sortable columns detailing each resource (for example, MolecularMatch), gene (BRAF), variant (V600E), disease (skin melanoma), drug (vemurafenib), evidence level, evidence direction, original URL and primary literature. Rows are expandable and include additional detail structure as both JavaScript object notation (JSON) and a table.
Extended Data Fig. 1
Extended Data Fig. 1. Harvesting and harmonizing records.
Harvested interpretation records (left column) from each knowledgebase vary in structure, a consequence of how they are represented and exported by their parent knowledgebase. Knowledgebase-specific rules are written to select data from harvested records for harmonization across a suite of element-specific harmonizers (center column). Colors represent different elements of an interpretation, which are each harmonized independently: genes (green), variants (cyan), diseases (red), drugs (purple), and evidence (yellow). Outputs from these harmonizers are assembled into normalized records (right column).
Extended Data Fig. 2
Extended Data Fig. 2. Knowledgebase overlap.
a, Upset plot of publications supporting clinical interpretations of variants. The overwhelming majority of publications are observed in only 1 of 6 resources. b, Upset plot of genes described by clinical interpretations of variants. Compared to other interpretation elements, genes are much more commonly shared between resources.
Extended Data Fig. 3
Extended Data Fig. 3. Knowledgebase disease enrichment.
Relative distribution of interpretations describing diseases across the VICC resources. Several resources are strongly enriched for one or more diseases compared to the entire dataset (see related Supplementary Table 8).
Extended Data Fig. 4
Extended Data Fig. 4. Search strategies.
a, A variant intersection search strategy. Variants that match at position and allele are referred to as “exact” (blue box), variants matching at position only as “positional” (green box), variants that largely (but not completely) intersect are considered “focal” (orange box), and variants that overlap only a small amount are considered “regional” (red box). The left column shows matched results for a query (search box, top), based on the intersection of coordinates in the right column. b, TopNode disease search strategy. Shown are a subset of disease nodes that all map to the parent TopNode DOID:1612, ‘Breast Cancer’. A query for DOID:3007 would return 44 interpretations (blue) from the queried term, its direct ancestors (DOID:3459, ‘Breast Carcinoma’ and DOID:1612, ‘Breast Cancer’) and descendants (DOID:3008, ‘invasive ductal carcinoma’), but no interpretations (red) from indirectly related terms (DOID:0050938, ‘breast lobular carcinoma’ and DOID:3457, ‘invasive lobular carcinoma’).
Extended Data Fig. 5
Extended Data Fig. 5. Commonality of observed mutations and their interpretations.
Interpretation count (x-axis) by number of queries (y-axis). Focal (yellow) and positional (green) searches provide a benefit to interpretability over exact matching. Notably, several high interpretation spikes are observed, due to variants that have both a large number of interpretations and are often observed in the GENIE cohort. These include KRAS G12 mutations, BRAF V600E, and several mutations in PIK3CA.
Extended Data Fig. 6
Extended Data Fig. 6. Gene intersection search.
Percentage of Project GENIE cohort with at least one interpretation from the indicated knowledgebase that matches patient variant genes (left group), patient variant genes and disease (center group), or patient variant genes, disease, and a Tier I evidence level (right group). This very broad match strategy is incompatible with the ASCO/AMP/CAP evidence guidelines.

References

    1. Huang L, et al. The cancer precision medicine knowledge base for structured clinical-grade mutations and interpretations. J. Am. Med. Inform. Assoc. 2017;24:513–519. - PMC - PubMed
    1. Yeh P, et al. DNA-mutation inventory to refine and enhance cancer treatment (DIRECT): a catalog of clinically relevant cancer mutations to enable genome-directed anticancer therapy. Clin. Cancer Res. 2013;19:1894–1901. - PMC - PubMed
    1. Forbes SA, et al. COSMIC: somatic cancer genetics at high-resolution. Nucleic Acids Res. 2017;45:D777–D783. - PMC - PubMed
    1. Ainscough BJ, et al. DoCM: a database of curated mutations in cancer. Nat. Methods. 2016;13:806–807. - PMC - PubMed
    1. Chakravarty, D. et al. OncoKB: a precision oncology knowledge base. J. Clin. Oncol. Precis Oncol. 10.1200/PO.17.00011 (2017). - PMC - PubMed

Publication types