Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Nov 7;105(5):933-946.
doi: 10.1016/j.ajhg.2019.09.015. Epub 2019 Oct 10.

Finding Diagnostically Useful Patterns in Quantitative Phenotypic Data

Collaborators, Affiliations

Finding Diagnostically Useful Patterns in Quantitative Phenotypic Data

Stuart Aitken et al. Am J Hum Genet. .

Abstract

Trio-based whole-exome sequence (WES) data have established confident genetic diagnoses in ∼40% of previously undiagnosed individuals recruited to the Deciphering Developmental Disorders (DDD) study. Here we aim to use the breadth of phenotypic information recorded in DDD to augment diagnosis and disease variant discovery in probands. Median Euclidean distances (mEuD) were employed as a simple measure of similarity of quantitative phenotypic data within sets of ≥10 individuals with plausibly causative de novo mutations (DNM) in 28 different developmental disorder genes. 13/28 (46.4%) showed significant similarity for growth or developmental milestone metrics, 10/28 (35.7%) showed similarity in HPO term usage, and 12/28 (43%) showed no phenotypic similarity. Pairwise comparisons of individuals with high-impact inherited variants to the 32 individuals with causative DNM in ANKRD11 using only growth z-scores highlighted 5 likely causative inherited variants and two unrecognized DNM resulting in an 18% diagnostic uplift for this gene. Using an independent approach, naive Bayes classification of growth and developmental data produced reasonably discriminative models for the 24 DNM genes with sufficiently complete data. An unsupervised naive Bayes classification of 6,993 probands with WES data and sufficient phenotypic information defined 23 in silico syndromes (ISSs) and was used to test a "phenotype first" approach to the discovery of causative genotypes using WES variants strictly filtered on allele frequency, mutation consequence, and evidence of constraint in humans. This highlighted heterozygous de novo nonsynonymous variants in SPTBN2 as causative in three DDD probands.

Keywords: developmental disease; genotype; naive Bayes; phenotype; tSNE.

PubMed Disclaimer

Conflict of interest statement

M.E.H. is a co-founder, consultant, and non-executive director of Congenica Ltd. The remaining authors declare no competing interests.

Figures

Figure 1
Figure 1
Summary of the Phenotypic Data from DDD Employed in This Study (A) Description of categorical data types used in the analyses described in the Results. (B) Description of quantitative data described in the Results. (C) Overview of the type and purpose of the analyses described in the Results. 6,993 of the first 7,833 probands from the DDD 8K trio exome data freeze had sufficient phenotypic data available to be used for the median Euclidean distance analysis and the naive Bayes classification approaches. The results of these analysis were gene models and in silico syndromes that were then used for analysis of strictly filtered inherited variants and a phenotype first approach to gene enrichment for the purposed of novel locus and/or mechanisms discovery.
Figure 2
Figure 2
Phenotype-Based Categorization of Individuals with Likely Causative De Novo Mutations in Confirmed Developmental Disorder Genes (A) Histograms showing the distribution of median distances of random sets of DDD probands for growth (purple), similarity of Human Phenotype Ontology term attributions (brown), and developmental milestone metrics (turquoise). In the upper panel the striking similarities observed in median distances within the group of individuals with de novo mutations (DNM) in ANKRD11 are indicated by the red line. In the lower panel the median distances for the individuals with DNM in DYNC1H1 are indicated by the red lines, which shows no obvious similarity within this group. (B) Histogram showing the distribution of pairwise Euclidean distances for growth metrics for the individuals with ANKRD11 DNM (purple). The red arrows representing the median of the pairwise comparisons of the individuals with high-impact inherited variants in ANKRD11 with the DNM individuals. The green line represents the mEuD of all DDD probands against the ANKRD11 DNM case subjects. (C) Boxplot showing the distribution of pairwise distances of individuals with inherited variants and DNM in ANKRD11, ARID1B, and KMT2A (dark purple). For comparison the distribution of distances between the individuals with DNM and all other DDD probands is shown (light purple). (D) The naive Bayes model for each of the 24 DNM genes with sufficient data is summarized by the discretized values in ten phenotypic categories. Cell shading indicates the discretized value where the value has a probability >0.5 (0.6 for binary variables). A key is provided describing the discrete groupings. These models were based on the observed phenotypes for each gene in isolation but generated apparently discriminative patterns. (E) To explore the diagnostic potential of the 24 gene models shown in (D), a confusion matrix was created showing the assignments based on each gene model using only phenotypic data from all individuals with diagnostic DNM assignments (columns). The diagonal represents the concordance of the phenotypic and genetic assignment.
Figure 3
Figure 3
Phenotypic Prototypes and Predictions from Naive Bayes Models Unsupervised naive Bayes clustering of the 6,993 DDD probands into 23 distinct classes, here termed in silico syndromes (ISSBayes). (A) A graphical representation of the phenotypic characteristics that define each ISSBayes using 10 discretized phenotypic values, a key is provided for each of the color-coded groups. (B) Scatterplots show the projection into two dimensions by t-SNE of growth for each ISSBayes where symbols are color coded by ISS. (C) To determine whether the ISSBayes showed any agreement with DNM in 24 different genes, we created a confusion matrix which did not indicate strong evidence of concordance of the phenotypic and genetic assignments. (D) We also defined eight sets of HPO terms that describe site-specific malformations looked for over-representation of probands when categorized by profile (Fisher’s exact test). Three malformation types were enriched in nine different profiles (p value adjusted for testing 23 profiles, adjusted p ≤ 0.05 considered significant).
Figure 4
Figure 4
Discovery of Candidate Diagnostic Genes by Phenotypic Profile (A) Heatmap of ISSBayes 1:23 tested for over-representation of genes passing the variant filtering in the phenotypic profiles (Fisher’s exact test, p value adjusted for testing 23 profiles, adjusted p ≤ 0.05 considered significant, 359 genes had at least 8 probands; mean 1.36 SNV per proband). The variants were derived from proband whole-exome sequencing in the 8k data freeze were filtered by MAF, consequence, pLi, CADD, and NSV scores to produce a set of 12,458 plausible diagnostic SNVs, mean 2.12 per proband in a set of 6,993 probands. Gene names in black are known developmental genes in the G2P database, those in blue are not in the G2P database. (B) A Manhattan plot shows the p values of enriched genes. (C) Pathogenic mutations at the CH1:CH2 interdomain interface of SPTBN2. The site of the novel DDD mutation identified here is shown in red, while the sites of the previously identified pathogenic mutations are shown in orange. The crystal structure of alpha-actinin (PDB: 4D1E) was used to build a homology model of SPTBN2 using SWISS-MODEL. The cryoelectron microscopy structure of the SPTBN2 CH1 domain (PDB: 6ANU) was very similar to the model (RMSD = 1.5 Å). (D) A cartoon of SPTBN2 protein structure. The distribution of pathogenic and likely pathogenic variants recorded in ClinVar is indicated by the yellow (missense) and red (nonsense and frameshift) triangles above the protein. The color of the variant text indicates the age of onset of the ataxia as defined by the key. The dashed line red boxes indicate the position of the de novo variants identified within the DDD cohort.

References

    1. Allanson J.E. Objective studies of the face of Noonan, Cardio-facio-cutaneous, and Costello syndromes: A comparison of three disorders of the Ras/MAPK signaling pathway. Am. J. Med. Genet. A. 2016;170:2570–2577. - PubMed
    1. Rauen K.A., Huson S.M., Burkitt-Wright E., Evans D.G., Farschtschi S., Ferner R.E., Gutmann D.H., Hanemann C.O., Kerr B., Legius E. Recent developments in neurofibromatoses and RASopathies: management, diagnosis and current and future therapeutic avenues. Am. J. Med. Genet. A. 2015;167A:1–10. - PMC - PubMed
    1. Ansari M., Poke G., Ferry Q., Williamson K., Aldridge R., Meynert A.M., Bengani H., Chan C.Y., Kayserili H., Avci S. Genetic heterogeneity in Cornelia de Lange syndrome (CdLS) and CdLS-like phenotypes with observed and predicted levels of mosaicism. J. Med. Genet. 2014;51:659–668. - PMC - PubMed
    1. Terret M.E., Sherwood R., Rahman S., Qin J., Jallepalli P.V. Cohesin acetylation speeds the replication fork. Nature. 2009;462:231–234. - PMC - PubMed
    1. Bergmann C. Educational paper: ciliopathies. Eur. J. Pediatr. 2012;171:1285–1300. - PMC - PubMed

Publication types