Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Feb 17;14(1):919.
doi: 10.1038/s41467-023-36634-6.

Hypothesis-free phenotype prediction within a genetics-first framework

Affiliations

Hypothesis-free phenotype prediction within a genetics-first framework

Chang Lu et al. Nat Commun. .

Abstract

Cohort-wide sequencing studies have revealed that the largest category of variants is those deemed 'rare', even for the subset located in coding regions (99% of known coding variants are seen in less than 1% of the population. Associative methods give some understanding how rare genetic variants influence disease and organism-level phenotypes. But here we show that additional discoveries can be made through a knowledge-based approach using protein domains and ontologies (function and phenotype) that considers all coding variants regardless of allele frequency. We describe an ab initio, genetics-first method making molecular knowledge-based interpretations for exome-wide non-synonymous variants for phenotypes at the organism and cellular level. By using this reverse approach, we identify plausible genetic causes for developmental disorders that have eluded other established methods and present molecular hypotheses for the causal genetics of 40 phenotypes generated from a direct-to-consumer genotype cohort. This system offers a chance to extract further discovery from genetic data after standard tools have been applied.

PubMed Disclaimer

Conflict of interest statement

Statement: B.G.T. is a co-founder of the OpenSNP database (non-financial competing interest). Patent application (Jan 2016): application number WO2017125778A1; inventors J.G., J.Z. and N.T.; author filed; status is granted in Japan and under examination in other countries; relevant to the method of spectral clustering applied to phenotype prediction. The remaining authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Framework incentive and design.
a Positioning relative to heritability interpretation from two prevailing genetic association analyses of many. G2P: Gene-to-phenotype databases, GWA: Genome wide association, PRS: Polygenic risk scores. The colour bar shows the genetic unit of analysis employed by each method. b Framework overview. The method takes an individual’s genetic data as input and produces a list of ontology terms for which the person is a potential outlier. It uses a large background of genomes to which the individual is compared, ontology databases with gene-phenotype relationships, and evolutionary intolerance of mutations in protein domain families encoded by hidden Markov models (HMMs). c Schematic illustration of the genetic landscape for an ontology term (HP:0000834 ‘abnormality of the adrenal glands’) highlighting genomes with high outlier scores. Each node represents a genome, and edges are proportional to genetic distance in eigenspace – in essence, a reduced dimensional feature space between genomes.
Fig. 2
Fig. 2. An outline of the genetics-first analysis framework.
– see methods for detail. Genome data is inputted at the top and causal hypotheses are outputted at the bottom. In the orange top box (algorithm), firstly the functional distance between each missense variant is derived from domain-based HMM probabilities, scaled depending on zygosity (top row). Subsequently (second row) variants falling in the region of a gene with homology to an HMM representing a functional unit (domain), are collated into a genetic profile for a phenotype using domain-phenotype mappings inferred using dcGO. This multi-domain collapsing of an ontology term can be likened to gene-based collapsing used in PheWAS. Next (bottom row of the orange box) the profile of combined functional distances (from the top row) is used to calculate a genetic distance to every genome in the background. Spectral clustering of the distance matrix identifies which genomes are outliers under the profile (HP:0000834 in this illustration); nodes represent genomes and are coloured by outlier score (bottom right of the orange box). In the next (blue) box, only the top-scoring outlier phenotypes are passed to the confirmation stage, where some of these genetics-first predictions are identified as correct, giving a likely cause of the verified phenotype that was predicted.
Fig. 3
Fig. 3. Evaluation of performance on DTC (a–d) and DDD (e) cohorts.
a Participants upload their DTC genome data on which outlier phenotypes are predicted, then shuffled with outliers from a decoy genome randomly selected from the background, to create a uniquely personalised questionnaire. Answers are used to confirm true predictions against decoys. b A test of 100,000 random permutations of the dataset shows that observed scores are on average higher for confirmed phenotypes, with a p-value of 8.25e-8 against randomly permuted scores. c The rate of identifying confirmed phenotypes by score threshold (blue) and number of above threshold predictions (green); at the default threshold of 0.022 the rate is more than double the rate for decoys, with a p-value of 7.93e-7 against 100,000 permutations. d The significance (green) of the top phenotypes by within-phenotype permutation of answers 100,000 times, and (blue) for the top x phenotypes, the number left after subtracting from the total those expected by chance. z-scores were derived from testing the null hypothesis that similar results can be obtained if scores are assigned randomly (see methods). p-value is calculated from z-score in a right-tailed hypothesis test. e For DDD patients, the 60 above-threshold predictions confirmed by clinical annotation with a p-value of 5.12e-4 versus data from 100,000 random permutations, using the same hypothesis test procedure as in d. Inset: the 50 patients with top predictions compared to published data for whether a genetic diagnosis has been identified through DDG2P, and split by presence of de-novo mutation (DNM).
Fig. 4
Fig. 4. Experimental test for GO:0010826.
This refers to 'any process that decreases the frequency, rate or extent of centrosome duplication'. a Representative examples of each of the indicated cell lines. Centrioles were detected by staining with gamma-tublin (γ-tub), nuclei were stained with DAPI. Asterisks indicate cells with more than two centrioles. The Hoik-1, Sehp-2 and Kegd-2 were control cell lines. b Histogram showing the percentage of cells with more than 2 centrioles per cell in the indicated cell lines. Results are summarised as the mean ± s.e.m. from 3 independent experiments (600-800 cells per cell line were analysed; each percentage from each experiment were shown as dots; *: P < 0.05; **: P < 0.005; ns: not significant). Specifically, the p-values are: Boqx-2 P = 0.045, Suul-1 P = 0.0021, Yoch-6 P = 0.0023 (one-sided t-test).
Fig. 5
Fig. 5. Types of genetic outlier.
a Outliers classified by underlying genetic variants into 4 types: 1-a, single variant only required; 1-b, single variant plus contributing variant(s); 2-a, multiple variants but dominated by one high-scoring variant; or 2-b, multiple variants required. b Distribution of the four types of outliers in the DDD cohort. c Violin plots to show the distribution of log-scaled minor allele frequencies (MAF) of variants involved in all predicted outliers, and by type of outliers in the DDD cohort. Colours show different types of outliers, schemes show roles of the variants involved, similar in a. Specifically, all (grey): 3682 variants involved in at least one outlier prediction, 539 variants involved in type 1-a outlier (dark blue), 601 required (star) and 1548 contributing (triangle) in type 1-b (light blue), 416 required (star) and 20 contributing (triangle) in type 2-a (orange), and 1767 in type 2-b (red). Distributions were generated using a kernel density estimate in the seaborn package. Boxes show quartiles and whiskers represent 1.5 multiple of interquartile range. d Percentage of outliers with a combinatorial component to the score with variants contributing non-independently.
Fig. 6
Fig. 6. Examples.
Any coordinates shown are relative to genome assembly GRCh37. a Novel gene association. ADAM7 [https://www.ncbi.nlm.nih.gov/gene/8756], ADAMTS13 [https://www.ncbi.nlm.nih.gov/gene/11093]. b Known variant. Chr6 Pos26093141, HFE-C282Y. c Novel variant in related gene. Chr12 Pos52913668, KRT5-G138E. d Single variant. Chr17 Pos29586054, NF1-L1425R. e Single variant. Chr19 Pos17927755, INSL3-R102C. f Novel variant in known gene. Chr3 Pos18143037, SOX2-L75P. g Combinatorial effect. 2 variants in CYP4B1-R375C/R340C and CYP2A7-T347A and CYP2D6-R245C. h Experimentally validated on HipSci. Chr3 Pos48414274, FBXW12-P6L. Panels (a) and (f) were created with BioRender.com.

References

    1. Wray, N. R., Goddard, M. E. & Visscher, P. M. Prediction of individual genetic risk of complex disease. Curr. Opin. Genet. Dev.18 257–263 (2008). - PubMed
    1. Manolio, T. A. et al. Finding the missing heritability of complex diseases. Nature vol. 461, 747–753 (2009). - PMC - PubMed
    1. Visscher, P. M. et al. 10 Years of GWAS Discovery: Biology, Function, and Translation. Am. J. Hum. Genet.101, 5–22 (2017). - PMC - PubMed
    1. MacArthur, D. G. et al. Guidelines for investigating causality of sequence variants in human disease. Nature508, 469–476 (2014). - PMC - PubMed
    1. Samocha KE, et al. A framework for the interpretation of de novo mutation in human disease. Nat. Genet. 2014;46:944–950. doi: 10.1038/ng.3050. - DOI - PMC - PubMed

Publication types