Generating Clinical-Grade Gene-Disease Validity Classifications Through the ClinGen Data Platforms

Matt W Wright¹, Courtney L Thaxton², Tristan Nelson³, Marina T DiStefano⁴, Juliann M Savatt³, Matthew H Brush⁵, Gloria Cheung¹, Mark E Mandell¹, Bryan Wulf¹, T J Ward², Scott Goehringer³, Terry O'Neill⁴, Phil Weller³, Christine G Preston¹, Ingrid M Keseler¹, Jennifer L Goldstein², Natasha T Strande³, Jennifer McGlaughon², Danielle R Azzariti⁴, Ineke Cordova³, Hannah Dziadzio⁴, Lawrence Babb⁴, Kevin Riehle⁶, Aleksandar Milosavljevic⁶, Christa Lese Martin³, Heidi L Rehm⁴, Sharon E Plon^{7

6}, Jonathan S Berg², Erin R Riggs³, Teri E Klein^{8

1}

Affiliations

¹ Department of Biomedical Data Science, Stanford University School of Medicine, Stanford, California, USA; email: wrightmw@stanford.edu, teri.klein@stanford.edu.
² Department of Genetics, University of North Carolina, Chapel Hill, North Carolina, USA; email: courtney_thaxton@med.unc.edu.
³ Geisinger, Danville, Pennsylvania, USA; email: thnelson@geisinger.edu.
⁴ Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, Massachusetts, USA.
⁵ Department of Biomedical Informatics, University of Colorado Anschutz Medical Campus, Aurora, Colorado, USA.
⁶ Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, USA.
⁷ Department of Pediatrics, Division of Hematology-Oncology, Baylor College of Medicine, Houston, Texas, USA.
⁸ Departments of Medicine (Biomedical Informatics Research) and Genetics, Stanford University School of Medicine, Stanford, California, USA.

PMID: 38663031
PMCID: PMC12001867
DOI: 10.1146/annurev-biodatasci-102423-112456

Review

Generating Clinical-Grade Gene-Disease Validity Classifications Through the ClinGen Data Platforms

Matt W Wright et al. Annu Rev Biomed Data Sci. 2024 Aug.

. 2024 Aug;7(1):31-50.

doi: 10.1146/annurev-biodatasci-102423-112456. Epub 2024 Jul 24.

Authors

Affiliations

¹ Department of Biomedical Data Science, Stanford University School of Medicine, Stanford, California, USA; email: wrightmw@stanford.edu, teri.klein@stanford.edu.
² Department of Genetics, University of North Carolina, Chapel Hill, North Carolina, USA; email: courtney_thaxton@med.unc.edu.
³ Geisinger, Danville, Pennsylvania, USA; email: thnelson@geisinger.edu.
⁴ Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, Massachusetts, USA.
⁵ Department of Biomedical Informatics, University of Colorado Anschutz Medical Campus, Aurora, Colorado, USA.
⁶ Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, USA.
⁷ Department of Pediatrics, Division of Hematology-Oncology, Baylor College of Medicine, Houston, Texas, USA.
⁸ Departments of Medicine (Biomedical Informatics Research) and Genetics, Stanford University School of Medicine, Stanford, California, USA.

PMID: 38663031
PMCID: PMC12001867
DOI: 10.1146/annurev-biodatasci-102423-112456

Abstract

Clinical genetic laboratories must have access to clinically validated biomedical data for precision medicine. A lack of accessibility, normalized structure, and consistency in evaluation complicates interpretation of disease causality, resulting in confusion in assessing the clinical validity of genes and genetic variants for diagnosis. A key goal of the Clinical Genome Resource (ClinGen) is to fill the knowledge gap concerning the strength of evidence supporting the role of a gene in a monogenic disease, which is achieved through a process known as Gene-Disease Validity curation. Here we review the work of ClinGen in developing a curation infrastructure that supports the standardization, harmonization, and dissemination of Gene-Disease Validity data through the creation of frameworks and the utilization of common data standards. This infrastructure is based on several applications, including the ClinGen GeneTracker, Gene Curation Interface, Data Exchange, GeneGraph, and website.

Keywords: biocuration; clinical genetics; data harmonization; data standards; precision medicine; research informatics.

PubMed Disclaimer

Figures

**Figure 1**
Increasing genomic data influx due to advancements in knowledge bases and technologies. Over time the development of several knowledge bases (e.g., OMIM, Mondo, *GeneReviews*) and/or technologies (e.g., polymerase chain reaction, microarray, gene panels) aimed toward evaluating and discovering genes associated with disease resulted in an ever-increasing amount of data. With this growing knowledge, ClinGen was launched in order to develop strategies to evaluate the clinical validity of gene–disease relationships and sort through much of the data that were generated prior to its establishment. This image represents a small fraction of the genomic knowledge bases and technologies that contributed to the influx of data. Abbreviations: ACMG, American College of Medical Genetics and Genomics; ClinGen, Clinical Genome Resource; DDD, Deciphering Developmental Disorders; GTR, Genetic Testing Registry; GWAS, genome-wide association study; HGNC, Human Genome Organization Gene Nomenclature Committee; ICD, International Classification of Diseases; *MIM*, *Mendelian Inheritance in Man*; Mondo, Monarch Disease Ontology; OMIM, Online Mendelian Inheritance in Man; PCR, polymerase chain reaction; RT-PCR, reverse transcription polymerase chain reaction.

**Figure 2**
ClinGen’s Gene–Disease Validity curation workflow and supporting infrastructure. During the course of curation, publication, and dissemination, a ClinGen Gene–Disease Validity curation passes through multiple systems. A curation is initiated in GeneTracker, where the appropriate gene, disease, and mode of inheritance; the correct expert panel to perform the curation; and an initial review of the relevant literature are recorded. This information is passed to the ClinGen GCI, where details about the evidence are recorded, structured, and scored according to the current SOP, based on a comprehensive review of the literature. This work is reviewed according to the policies of the expert panel performing the work, and a final classification is agreed upon and approved in the GCI. Following approval the curation is sent to GeneGraph via the ClinGen Data Exchange, an Apache Kafka–based messaging system, which facilitates durable and auditable transfers of data between systems. In GeneGraph the data received are transformed from the format used to support the user interface of the GCI into a structured, normalized format based on the Scientific Evidence and Provenance Information Ontology model. A queryable application programming interface based on GraphQL reflecting this model is presented to the ClinGen website, where curation data are viewable by the public. Currently summary curation data are available for download via the website; a goal is to present full, structured, computable data for download by resources in the community. Abbreviations: GA4GH, Global Alliance for Genomics and Health; GCI, Gene Curation Interface; GenCC, Gene Curation Coalition; OMIM, Online Mendelian Inheritance in Man; SOP, standard operating procedure; UCSC, University of California Santa Cruz Genome Browser.

**Figure 3**
Example of a Calculated Classification Matrix in the Gene Curation Interface (GCI). This matrix shows the evidence scores automatically tabulated for the *PEX19* (HGNC:9713)/peroxisome biogenesis disorder (MONDO:0019234)/autosomal recessive inheritance (HP:0000007) gene–disease–mode of inheritance within the GCI, as curated by the ClinGen Peroxisomal Disorders Gene Curation Expert Panel (35). Points have been capped in three places: (①) For variant evidence the 13.2 total points have been capped to 12 points counted, (②) for functional evidence the 3 total points have been capped to 2 points counted, and (③) for all genetic evidence the 13 total points have been capped to 12 points counted. A version of this matrix is replicated in the final classification published on the ClinGen website: https://search.clinicalgenome.org/kb/genes/HGNC:9713. Abbreviation: LOD, logarithm of the odds.

**Figure 4**
Example of data structure. When ClinGen Gene–Disease Validity curations are processed by GeneGraph, the data structure is transformed to align with the Scientific Evidence and Provenance Information Ontology data model. The Gene Curation Interface generates durable universally unique identifiers for most entities created during curation; GeneGraph leverages these and generates durable identifiers of its own so that every aspect of the curation is structured in a way that can be leveraged in many different ways by downstream systems. The structure uses an ontological foundation for data types, descriptive elements, and relationships, allowing ClinGen curations to be merged with other data by leveraging semantic web technologies. Similar concepts are represented in a consistent way; for example, the score a curator applies to a given piece of evidence has the same data model regardless of the type of evidence being scored. This truncated example of the data structure for curation scoring offers a sense of how the reasoning supporting a curation is represented using structured data. The data represented are from the same *PEX19*/peroxisome biogenesis disorder/autosomal recessive inheritance gene–disease–mode of inheritance used in Figure 3. A complete data structure can be viewed in Supplemental Figure 2.

**Figure 5**
Gene–Disease Validity curation lifecycle. The Gene–Disease Validity curation lifecycle begins with biocurators (curators) collecting and annotating data (aggregate), then applying standards and frameworks (curate), with the ultimate goal of publishing records for use by the clinical community (disseminate). It is expected that many Gene–Disease Validity curations will need to be updated based on growing bodies of knowledge about the gene in relation to the disease; this is a process termed recuration. Recuration is especially necessary for gene–disease classifications that are classified as Moderate, Limited, and Disputed.

See this image and copyright information in PMC

References

1. Amberger JS, Bocchini CA, Scott AF, Hamosh A. 2019. OMIM.org: leveraging knowledge across phenotype–gene relationships. Nucleic Acids Res. 47:D1038–43 - PMC - PubMed
1. Povey S, Lovering R, Bruford E, Wright M, Lush M, Wain H. 2001. The HUGO Gene Nomenclature Committee (HGNC). Hum. Genet 109:678–80 - PubMed
1. Claussnitzer M, Cho JH, Collins R, Cox NJ, Dermitzakis ET, et al. 2020. A brief history of human disease genetics. Nature 577:179–89 - PMC - PubMed
1. Crespi S 2021. Looking back at 20 years of human genome sequencing. Science Podcast, Feb. 4. https://www.science.org/content/podcast/looking-back-20-years-human-geno...
1. Rehm HL, Berg JS, Brooks LD, Bustamante CD, Evans JP, et al. 2015. ClinGen—the Clinical Genome Resource. N. Engl. J. Med 372:2235–42 - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
- Ingenta plc
- PubMed Central
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Generating Clinical-Grade Gene-Disease Validity Classifications Through the ClinGen Data Platforms

Affiliations

Generating Clinical-Grade Gene-Disease Validity Classifications Through the ClinGen Data Platforms

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous