Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Aug 25;4(1):2200016.
doi: 10.1002/ggn2.202200016. eCollection 2023 Mar.

GA4GH Phenopackets: A Practical Introduction

Affiliations

GA4GH Phenopackets: A Practical Introduction

Markus S Ladewig et al. Adv Genet (Hoboken). .

Abstract

The Global Alliance for Genomics and Health (GA4GH) is developing a suite of coordinated standards for genomics for healthcare. The Phenopacket is a new GA4GH standard for sharing disease and phenotype information that characterizes an individual person, linking that individual to detailed phenotypic descriptions, genetic information, diagnoses, and treatments. A detailed example is presented that illustrates how to use the schema to represent the clinical course of a patient with retinoblastoma, including demographic information, the clinical diagnosis, phenotypic features and clinical measurements, an examination of the extirpated tumor, therapies, and the results of genomic analysis. The Phenopacket Schema, together with other GA4GH data and technical standards, will enable data exchange and provide a foundation for the computational analysis of disease and phenotype information to improve our ability to diagnose and conduct research on all types of disorders, including cancer and rare diseases.

Keywords: FAIR data; Global Alliance for Genomics and Health; Human Phenotype Ontology; Phenopacket Schema; deep phenotyping.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Figure 1
Figure 1
Phenopackets Protobuf message comparison. (a) Definition of OntologyClass, a data type in the Phenopacket Schema, in protobuf. (b) Representation of an instance of OntologyClass (representing the HPO term for neutropenia) in a YAML format. (c) Equivalent representation in JSON format.
Figure 2
Figure 2
Phenopacket Schema overview. The GA4GH Phenopacket Schema is a hierarchical structure that consists of two required fields, id and MetaData, as well as eight optional fields, Individual, Disease, Interpretation, Biosample, PhenotypicFeature, Measurement, MedicalAction, and files, each of which is discussed in this article. A detailed version of the schema, including elements from VRS/VRSATILE, is shown in ref. [2].
Figure 3
Figure 3
Phenopacket messages. This Figure as well as Figures 4, 5, 6, 7, 8, 9, 10, 11, 12 show excerpts of the phenopacket with line numbers added. (a) Line 1 shows the beginning of the phenopacket (in a YAML file that only contains a single phenopacket, line 1 would contain a “–” instead of “phenopacket”). Line 2 contains the identifier of the phenopacket, which is required to be present but whose syntax is arbitrary and generally should be specified by the application. Lines 3–9 show the Individual message. (b) An example vitalStatus message for another example patient (not related to the example) that indicates the time and cause of death as well as the survival time following the primary diagnosis.
Figure 4
Figure 4
List of PhenotypicFeatures. Clinodactyly, which is not known to be related to retinoblastoma and is presumably an incidental finding, was noted at the age of 3 months (P3M). Leukocoria was noted at the age of 4 months, strabismus at the age of 5 months and 15 days, and retinal detachment at the age of 6 months.
Figure 5
Figure 5
Measurement. A) Measurements of intraocular pressure (IOP) in the left eye (lines 48–65) and right eye (lines 66–83).
Figure 6
Figure 6
Biosample. The id (line 85) is required and can be used to relate genomic interpretations to the biosample that corresponds tothe interpretation (see Figure 8). Lines 86–88 represent the tissue of origin of the specimen, lines 89–95 represent phenotypic features of the specimen, and lines 96–108 represent a measurement taken of the maximal size of the tumor. Note that the same PhenotypicFeature and Measurement message definitions are used here as described above. The tumor progression field (lines 109–111) is used to specify whether a tumor is primary, metastatic, or recurrent. Lines 112–126 contain the pathological TNM (primary Tumor, lymph Nodes, distance Metastasis) assessment. Finally, lines 127–133 specify a File with results of whole genome sequencing performed on this tissue sample (See the section on File messages, below, for explanations). The interpretation based on this sequencing is presented in Figure 8.
Figure 7
Figure 7
Interpretation and GenomicInterpretation (1). The first portion of the Interpretation message is shown, with the status of the interpretation (solved) on line 136, the diagnosis to which the interpretation refers on lines 137–140, and the list of GenomicInterpretations starting on line 141. The first of two GenomicInterpretations is shown in this Figure, describing the mosaic deletion on chromosome 13. The deletion is specified as a CopyNumber message, which indicates the chromosome (the corresponding NCBI RefSeq NC_000013.14 identifier is listed), start and end positions, and the copy number (1, corresponding to a loss of one of the two copies of RB1). The extensions field is used to specify additional information about variants, such as the degree of mosaicism (here) or the allele frequency in the second GenomicInterpretation.
Figure 8
Figure 8
GenomicInterpretation (2). The GenomicInterpretation message uses the subjectOrBiosample field to indicate the source of the sequenced material. In this case, the source was the tumor specimen described in the Biosample message of Figure 6. The variant, whose id is indicated as the corresponding dbSNP accession number (rs121913300) is classified as pathogenic using ACMG criteria (ClinVar VCV000126824.9). Information about the genomic location of the variant is provided as an Allele message in lines 171–180. The corresponding HGVS expression is provided in the label field (line 181; in general, the label can contain arbitrary text). The GeneContext field indicates the affected gene with its Human Gene Nomenclature Committee identifier and gene symbol in lines 182–184. The corresponding variant call format (VCF) representation of the variant is shown in the VcfRecord message in lines 190–195. The zygosity of the variant is specified in the allelicState field.
Figure 9
Figure 9
Disease message. Disease message describing the diagnosis, the stage,[ 21 ] age of onset, and primary site of retinoblastoma in the patient.
Figure 10
Figure 10
A list of three MedicalAction messages. The Treatment message (lines 220–250) refers to the intraarterial administration of melphalan. A single dose was administered on the indicated date to treat retinoblastoma with curative intent. Vasospasm occurred as an adverse drug effect which necessitated the termination of this treatment. The TherapeuticRegimen message (lines 251–267) refers to the administration of three chemotherapeutic drugs (carboplatin, etoposide, vincristine) according to standard protocols from the age of 7–8 months. The Procedure message (lines 268–283) describes the surgical removal of the affected eye.
Figure 11
Figure 11
FIle message. The URI field contains Uniform Resource Identifier, that is, a string for locating a file on the internet or other network or a computer file system. The individualToFileIdentifiers field is a map from the identifier used in the phenopacket to those used in the VCF or another file. The fileAttributes field is a list of key value pairs used to specify the genome assembly and the file format.
Figure 12
Figure 12
Phenopackets Metadata. The MetaData message contains information about each ontology used to provide terms in the phenopacket as well as the phenopacket version. Version 1 of the GA4GH standard was released in 2019 to elicit feedback from the community. Version 2 was developed on the basis of this feedback and is described here.

References

    1. Rehm H. L., Page A. J. H., Smith L., Adams J. B., Alterovitz G., Babb L. J., Barkley M. P., Baudis M., Beauvais M. J. S., Beck T., Beckmann J. S., Beltran S., Bernick D., Bernier A., Bonfield J. K., Boughtwood T. F., Bourque G., Bowers S. R., Brookes A. J., Brudno M., Brush M. H., Bujold D., Burdett T., Buske O. J., Cabili M. N., Cameron D. L., Carroll R. J., Casas‐Silva E., Chakravarty D., Chaudhari B. P., et al., Cell Genom. 2021, 1, 100029. - PubMed
    1. Jacobsen J. O. B., Baudis M., Baynam G. S., Beckmann J. S., Beltran S., Buske O. J., Callahan T. J., Chute C. G., Courtot M., Danis D., Elemento O., Essenwanger A., Freimuth R. R., Gargano M. A., Groza T., Hamosh A., Harris N. L., Kaliyaperumal R., Lloyd K. C. K., Khalifa A., Krawitz P. M., Köhler S., Laraway B. J., Lehväslaiho H., Matalonga L., McMurry J. A., Metke‐Jimenez A., Mungall C. J., Munoz‐Torres M. C., Ogishima S., et al., Nat. Biotechnol. 2022, 40, 817. - PMC - PubMed
    1. Köhler S., Gargano M., Matentzoglu N., Carmody L. C., Lewis‐Smith D., Vasilevsky N. A., Danis D., Balagura G., Baynam G., Brower A. M., Callahan T. J., Chute C. G., Est J. L., Galer P. D., Ganesan S., Griese M., Haimel M., Pazmandi J., Hanauer M., Harris N. L., Hartnett M. J., Hastreiter M., Hauck F., He Y., Jeske T., Kearney H., Kindle G., Klein C., Knoflach K., Krause R., et al., Nucleic Acids Res. 2021, 49, D1207. - PMC - PubMed
    1. de Coronado S., Wright L. W., Fragoso G., Haber M. W., Hahn‐Dantona E. A., Hartel F. W., Quan S. L., Safran T., Thomas N., Whiteman L., J. Biomed. Inform. 2009, 42, 530. - PubMed
    1. Gargallo P., Oltra S., Balaguer J., Barranco H., Yáñez Y., Segura V., Juan‐Ribelles A., Calabria I., Llavador M., Castel V., Cañete A., Int. J. Retina Vitreous 2021, 7, 50. - PMC - PubMed