The GA4GH Variation Representation Specification: A computational framework for variation representation and federated identification

Alex H Wagner^{1

2

3}, Lawrence Babb⁴, Gil Alterovitz^{5

6}, Michael Baudis⁷, Matthew Brush⁸, Daniel L Cameron^{9

10}, Melissa Cline¹¹, Malachi Griffith¹², Obi L Griffith¹², Sarah E Hunt¹³, David Kreda¹⁴, Jennifer M Lee¹⁵, Stephanie Li¹⁶, Javier Lopez¹⁷, Eric Moyer¹⁸, Tristan Nelson¹⁹, Ronak Y Patel²⁰, Kevin Riehle²⁰, Peter N Robinson²¹, Shawn Rynearson²², Helen Schuilenburg¹³, Kirill Tsukanov¹³, Brian Walsh⁸, Melissa Konopko¹⁶, Heidi L Rehm^{4

23}, Andrew D Yates¹³, Robert R Freimuth²⁴, Reece K Hart^{4

25}

Affiliations

¹ Department of Pediatrics, The Ohio State University College of Medicine, Columbus, OH 43210, USA.
² The Steve and Cindy Rasmussen Institute for Genomic Medicine, Nationwide Children's Hospital, Columbus, OH 43215, USA.
³ Lead Contact.
⁴ Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA.
⁵ Harvard Medical School, Boston, MA 02115, USA.
⁶ Department of Medicine, Brigham and Women's Hospital, Boston, MA 02115, USA.
⁷ University of Zurich and Swiss Institute of Bioinformatics, Zurich, Switzerland.
⁸ Oregon Health & Science University, Portland, OR 97239, USA.
⁹ Bioinformatics Division, Walter and Eliza Hall Institute of Medical Research, Melbourne, VIC, Australia.
¹⁰ Department of Medical Biology, University of Melbourne, Melbourne, VIC, Australia.
¹¹ UC Santa Cruz Genomics Institute, Santa Cruz, CA 95060, USA.
¹² Washington University School of Medicine, St. Louis, MO 63108, USA.
¹³ European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK.
¹⁴ Department of Biomedical Informatics, Harvard Medical School, Boston MA 02115, USA.
¹⁵ Essex Management LLC and National Cancer Institute, Rockville, MD 20850, USA.
¹⁶ The Global Alliance for Genomics and Health, Toronto, ON, Canada.
¹⁷ Genomics England, London EC1M 6BQ, UK.
¹⁸ National Center for Biotechnology Information, National Library of Medicine National Institutes of Health, Bethesda, MD 20894, USA.
¹⁹ Geisinger Health, Danville, PA 17822, USA.
²⁰ Baylor College of Medicine, Houston, TX 77030, USA.
²¹ Jackson Laboratory for Genomic Medicine, Farmington, CT 06032, USA.
²² Utah Center for Genetic Discovery, University of Utah, Salt Lake City, UT 84112, USA.
²³ Center for Genomic Medicine, Massachusetts General Hospital, Cambridge, MA 02142, USA.
²⁴ Center for Individualized Medicine, Department of Artificial Intelligence and Informatics, Mayo Clinic, Rochester, MN 55905, USA.
²⁵ MyOme, Inc., Menlo Park, CA 94070, USA.

PMID: 35311178
PMCID: PMC8929418
DOI: 10.1016/j.xgen.2021.100027

The GA4GH Variation Representation Specification: A computational framework for variation representation and federated identification

Alex H Wagner et al. Cell Genom. 2021.

. 2021 Nov 10;1(2):100027.

doi: 10.1016/j.xgen.2021.100027.

Authors

Affiliations

¹ Department of Pediatrics, The Ohio State University College of Medicine, Columbus, OH 43210, USA.
² The Steve and Cindy Rasmussen Institute for Genomic Medicine, Nationwide Children's Hospital, Columbus, OH 43215, USA.
³ Lead Contact.
⁴ Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA.
⁵ Harvard Medical School, Boston, MA 02115, USA.
⁶ Department of Medicine, Brigham and Women's Hospital, Boston, MA 02115, USA.
⁷ University of Zurich and Swiss Institute of Bioinformatics, Zurich, Switzerland.
⁸ Oregon Health & Science University, Portland, OR 97239, USA.
⁹ Bioinformatics Division, Walter and Eliza Hall Institute of Medical Research, Melbourne, VIC, Australia.
¹⁰ Department of Medical Biology, University of Melbourne, Melbourne, VIC, Australia.
¹¹ UC Santa Cruz Genomics Institute, Santa Cruz, CA 95060, USA.
¹² Washington University School of Medicine, St. Louis, MO 63108, USA.
¹³ European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK.
¹⁴ Department of Biomedical Informatics, Harvard Medical School, Boston MA 02115, USA.
¹⁵ Essex Management LLC and National Cancer Institute, Rockville, MD 20850, USA.
¹⁶ The Global Alliance for Genomics and Health, Toronto, ON, Canada.
¹⁷ Genomics England, London EC1M 6BQ, UK.
¹⁸ National Center for Biotechnology Information, National Library of Medicine National Institutes of Health, Bethesda, MD 20894, USA.
¹⁹ Geisinger Health, Danville, PA 17822, USA.
²⁰ Baylor College of Medicine, Houston, TX 77030, USA.
²¹ Jackson Laboratory for Genomic Medicine, Farmington, CT 06032, USA.
²² Utah Center for Genetic Discovery, University of Utah, Salt Lake City, UT 84112, USA.
²³ Center for Genomic Medicine, Massachusetts General Hospital, Cambridge, MA 02142, USA.
²⁴ Center for Individualized Medicine, Department of Artificial Intelligence and Informatics, Mayo Clinic, Rochester, MN 55905, USA.
²⁵ MyOme, Inc., Menlo Park, CA 94070, USA.

PMID: 35311178
PMCID: PMC8929418
DOI: 10.1016/j.xgen.2021.100027

Abstract

Maximizing the personal, public, research, and clinical value of genomic information will require the reliable exchange of genetic variation data. We report here the Variation Representation Specification (VRS, pronounced "verse"), an extensible framework for the computable representation of variation that complements contemporary human-readable and flat file standards for genomic variation representation. VRS provides semantically precise representations of variation and leverages this design to enable federated identification of biomolecular variation with globally consistent and unique computed identifiers. The VRS framework includes a terminology and information model, machine-readable schema, data sharing conventions, and a reference implementation, each of which is intended to be broadly useful and freely available for community use. VRS was developed by a partnership among national information resource providers, public initiatives, and diagnostic testing laboratories under the auspices of the Global Alliance for Genomics and Health (GA4GH).

PubMed Disclaimer

Conflict of interest statement

DECLARATION OF INTERESTS H.L.R. is a member of the advisory board for Cell Genomics.

Figures

**Figure 1**
Components and extensions of the Variation Representation Specification VRS is a specification comprising multiple layered components (solid border blue boxes) that each serve as a foundation for the components above it. While VRS provides a full end-to-end framework, each component of the framework can be extended by the community into alternate forms (dash border gray boxes). For example, VRS provides a JSON Schema implementation based upon the Terminology and Information Model, but that same Terminology and Information Model may be used to build schemas in DTD, XSD, Google Protocol Buffers, Apache Thrift, or other data validation frameworks. This modular construction of VRS encourages interoperability across many scenarios and communities.

**Figure 2**
Information model The VRS information model consists of several interdependent data classes, including both concrete classes and abstract superclasses (indicated by ≪abst≫ stereotype in header). These classes may be broadly categorized as conceptual representations of Variation (green boxes), Feature (blue boxes), Location (light blue boxes), Sequence Expression (purple boxes), and General Purpose Types (gray boxes). The general purpose types support the primary classes, including intervals, ranges, Number, and GA4GH Sequence strings (not shown). While all VRS objects are value objects, only some objects are intended to be identifiable (Variation, Location, and Sequence). Conceptual inheritance relationships between classes is indicated by connecting lines.

**Figure 3**
VRS conventions VRS provides many conventions to precisely describe and normalize molecular variation. (A) A key difference between VRS and other genomic variant formats such as HGVS and VCF is the use of inter-residue coordinates. In this example, the same residue coordinates (gray shading) are used to ambiguously describe the space between two nucleotides where an insertion occurs (top), or the space including the two nucleotides for deletions and substitutions (bottom). Inter-residue coordinates (blue shading) allow for precise representation of nucleotide or inter-nucleotide position without requiring knowledge of the operation, decoupling location representation from the representation of variation. (B) Here, a three-nucleotide insertion (GCA) occurs in a repetitive region, creating ambiguity as to where the true event (first row) actually occurred. Three systems for describing this variant are depicted. In HGVS (second row), the 3′-most position is selected to represent the insertion. An alternative HGVS representation has the 3′-most position define the repeat unit (here, “AGC”), then the variation is described by the number of repeated units from the first nucleotide of the 5′-most unit in the reference sequence. In VCF (third row), the leftmost insertion point is selected and an “anchor base” prepended to describe the insertion. In contrast to these other systems, VRS (fourth row) avoids the selection of an arbitrary over-precise representation and instead uses a full-justification representation that covers the entire region of ambiguity. (C) Full-justification Allele normalization is enabled by a specified normalization algorithm. In this example, the unnormalized Allele “reference” and “alternate” sequences (step 0) are trimmed of their common suffix “CA” (step 1). Only the resulting “reference” sequence is blank, indicating this is an insertion, and the algorithm continues (step 2). The non-blank “alternate” sequence is incrementally rolled left to identify the left bound of matching repetitive sequence, then incrementally rolled right to identify the right bound (step 3). These boundaries are used to prepend (step 4, green sequence) and append (step 4, orange sequence) the regions of ambiguity to both sequences, resulting in a normalized, fully justified Allele (step 4, blue sequence).

**Figure 4**
Computed identifiers VRS provides a mechanism for federated variation identification via the Computed Identifier Algorithm. (A) The Computed Identifier Algorithm is defined in three stages. First, an identifiable VRS object such as an Allele (blue box) is transformed into a well-defined and canonical serialized JSON representation. The serialized Binary Large Object (BLOB) is then digested via the SHA-512 algorithm, truncated to retain only the first 24 bytes, and subsequently encoded using base64url. The resulting digest string (green text) is then appended to the object type identifier; for an Allele object, the identifier prefix is “VA” (blue text). The identifier is then assembled into a compact URI (CURIE) under the ga4gh namespace (orange text). (B) Use of the VRS framework enables de-duplication of identical variation concepts with differing HGVS descriptions. Here, multiple synonymous HGVS descriptions are indicated for a variant on genome builds GRCh37 and GRCh38, the corresponding transcript variant, and predicted protein translation. These four contexts (two genome assemblies, transcript, and protein) resolve to four distinct identifiers, regardless of which synonymous description is used to build the VRS object. Ellipses (“...”) used in objects and strings in this diagram represent content that is omitted for simplicity of presentation. The VRS-Python implementation provides full support for all operations depicted here, including translating between HGVS and VRS formats. For additional details, see https://vrs.ga4gh.org/en/1.2/impl-guide/computed_identifiers.html.

See this image and copyright information in PMC

Cited by

vcfdist: accurately benchmarking phased small variant calls in human genomes.
Dunn T, Narayanasamy S. Dunn T, et al. Nat Commun. 2023 Dec 9;14(1):8149. doi: 10.1038/s41467-023-43876-x. Nat Commun. 2023. PMID: 38071244 Free PMC article.
SARS-CoV-2 genomic contextual data harmonization: recommendations from a mixed methods analysis of COVID-19 case report forms across Canada.
Cameron R, Savić Kallesøe S, Griffiths EJ, Dooley D, Sridhar A, Sehar A, Tindale LC, Hsiao WWL. Cameron R, et al. Arch Public Health. 2025 Apr 30;83(1):117. doi: 10.1186/s13690-025-01604-5. Arch Public Health. 2025. PMID: 40307880 Free PMC article.
The European Variation Archive: a FAIR resource of genomic variation for all species.
Cezard T, Cunningham F, Hunt SE, Koylass B, Kumar N, Saunders G, Shen A, Silva AF, Tsukanov K, Venkataraman S, Flicek P, Parkinson H, Keane TM. Cezard T, et al. Nucleic Acids Res. 2022 Jan 7;50(D1):D1216-D1220. doi: 10.1093/nar/gkab960. Nucleic Acids Res. 2022. PMID: 34718739 Free PMC article.
Guidelines for releasing a variant effect predictor.
Livesey BJ, Badonyi M, Dias M, Frazer J, Kumar S, Lindorff-Larsen K, McCandlish DM, Orenbuch R, Shearer CA, Muffley L, Foreman J, Glazer AM, Lehner B, Marks DS, Roth FP, Rubin AF, Starita LM, Marsh JA. Livesey BJ, et al. Genome Biol. 2025 Apr 15;26(1):97. doi: 10.1186/s13059-025-03572-z. Genome Biol. 2025. PMID: 40234898 Free PMC article.
Candidate targets of copy number deletion events across 17 cancer types.
Huang Q, Baudis M. Huang Q, et al. Front Genet. 2023 Jan 16;13:1017657. doi: 10.3389/fgene.2022.1017657. eCollection 2022. Front Genet. 2023. PMID: 36726722 Free PMC article.

See all "Cited by" articles

References

1. Hudson T.J., Anderson W., Artez A., Barker A.D., Bell C., Bernabé R.R., Bhan M.K., Calvo F., Eerola I., Gerhard D.S., et al. International Cancer Genome Consortium International network of cancer genome projects. Nature. 2010;464:993–998. - PMC - PubMed
1. Sherry S.T., Ward M.H., Kholodov M., Baker J., Phan L., Smigielski E.M., Sirotkin K. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 2001;29:308–311. - PMC - PubMed
1. Landrum M.J., Lee J.M., Benson M., Brown G., Chao C., Chitipiralla S., Gu B., Hart J., Hoffman D., Hoover J., et al. ClinVar: public archive of interpretations of clinically relevant variants. Nucleic Acids Res. 2016;44(D1):D862–D868. - PMC - PubMed
1. Firth H.V., Richards S.M., Bevan A.P., Clayton S., Corpas M., Rajan D., Van Vooren S., Moreau Y., Pettett R.M., Carter N.P. DECIPHER: Database of Chromosomal Imbalance and Phenotype in Humans Using Ensembl Resources. Am. J. Hum. Genet. 2009;84:524–533. - PMC - PubMed
1. ENCODE Project Consortium The ENCODE (ENCyclopedia Of DNA Elements) Project. Science. 2004;306:636–640. - PubMed

Grants and funding

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

The GA4GH Variation Representation Specification: A computational framework for variation representation and federated identification

Affiliations

The GA4GH Variation Representation Specification: A computational framework for variation representation and federated identification

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Grants and funding

LinkOut - more resources

Full Text Sources