This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

[Preprint]. 2024 Jun 6:2024.02.22.581566.

doi: 10.1101/2024.02.22.581566.

Predicting the direction of phenotypic difference

David Gokhman¹, Keith D Harris², Shai Carmi³, Gili Greenbaum²

Affiliations

¹ Department of Molecular Genetics, The Weizmann Institute of Science, Rehovot 76100, Israel.
² Department of Ecology, Evolution and Behavior, The Hebrew University of Jerusalem, Jerusalem 91904, Israel.
³ Braun School of Public Health and Community Medicine, The Hebrew University of Jerusalem, Jerusalem 9112102, Israel.

PMID: 38895291
PMCID: PMC11185551
DOI: 10.1101/2024.02.22.581566

Predicting the direction of phenotypic difference

David Gokhman et al. bioRxiv. 2024.

[Preprint]. 2024 Jun 6:2024.02.22.581566.

doi: 10.1101/2024.02.22.581566.

Authors

David Gokhman¹, Keith D Harris², Shai Carmi³, Gili Greenbaum²

Affiliations

¹ Department of Molecular Genetics, The Weizmann Institute of Science, Rehovot 76100, Israel.
² Department of Ecology, Evolution and Behavior, The Hebrew University of Jerusalem, Jerusalem 91904, Israel.
³ Braun School of Public Health and Community Medicine, The Hebrew University of Jerusalem, Jerusalem 9112102, Israel.

PMID: 38895291
PMCID: PMC11185551
DOI: 10.1101/2024.02.22.581566

Update in

Predicting the direction of phenotypic difference.
Gokhman D, Harris KD, Carmi S, Greenbaum G. Gokhman D, et al. Nat Commun. 2025 Jul 26;16(1):6898. doi: 10.1038/s41467-025-62355-z. Nat Commun. 2025. PMID: 40715205 Free PMC article.

Abstract

Predicting phenotypes from genomic data is a key goal in genetics, but for most complex phenotypes, predictions are hampered by incomplete genotype-to-phenotype mapping. Here, we describe a more attainable approach than quantitative predictions, which is aimed at qualitatively predicting phenotypic differences. Despite incomplete genotype-to-phenotype mapping, we show that it is relatively easy to determine which of two individuals has a greater phenotypic value. This question is central in many scenarios, e.g., comparing disease risk between individuals, the yield of crop strains, or the anatomy of extinct vs extant species. To evaluate prediction accuracy, i.e., the probability that the individual with the greater predicted phenotype indeed has a greater phenotypic value, we developed an estimator of the ratio between known and unknown effects on the phenotype. We evaluated prediction accuracy using human data from tens of thousands of individuals from either the same family or the same population, as well as data from different species. We found that, in many cases, even when only a small fraction of the loci affecting a phenotype is known, the individual with the greater phenotypic value can be identified with over 90% accuracy. Our approach also circumvents some of the limitations in transferring genetic association results across populations. Overall, we introduce an approach that enables accurate predictions of key information on phenotypes - the direction of phenotypic difference - and suggest that more phenotypic information can be extracted from genomic data than previously appreciated.

PubMed Disclaimer

Figures

**Figure 1:**
Schematic of the approach to predict the direction of phenotypic difference. (a) We start with a phenotyped individual and an unphenotyped individual. We consider the known and unknown effects contributing to (or associated with) the phenotype of interest. Known genetic effects on the phenotypic difference are in blue (measured in units of the phenotype), unknown genetic and non-genetic effects are in yellow. Cases where the contribution is identical between the two individuals (and therefore do not affect the phenotypic difference) are in gray. (b) Only the known divergent effects are used to predict the phenotypic difference between the individuals. The sum of the known effects can be thought of as the final position of a random walk with step sizes and directions corresponding to the effect sizes. (c) The direction of the total sum of the known effects is used to make a prediction of the direction of phenotypic difference between the phenotyped and unphenotyped individuals. If the sum of the known effects between the individuals is positive, we predict that the phenotypic value of the unphenotyped individual is larger than the phenotyped individual (and the opposite prediction if the sum is negative). (d) Modeling prediction accuracy using random walks. The curves represent random walks where each step is an effect size. The blue curve shows the known effects of a specific random walk, and the sign (positive or negative) of the blue point at the end of the walk is the predicted direction of phenotypic difference. The yellow curves show potential random walks of the unknown effects (genetic and environmental). In this example, effect sizes were drawn from a standard normal distribution. For a correct prediction of the direction of the phenotypic difference, the sum of the known effects (blue point) and the true phenotypic difference (yellow dot) need to be on the same side of the x-axis (both below or both above).

**Figure 2:**
Evaluating prediction accuracy using the known-to-unknown ratio ( $κ$ ). (a) Simulated prediction accuracies for various $κ$ values (grouped into equally spaced bins), for different proportions of the known vs. unknown effects (10%, 50%, and 90% of effects known). Effect sizes were drawn from a normal distribution. In gray is the theoretical expectation from Eq. 4. (b) The distribution of $κ$ values for the case where the known effects are randomly sampled. The vertical line denotes the $κ$ values required for prediction accuracy of $P > 0.95 (κ = 0.62$ ) (c). The distribution of $κ$ values for the case where the known effects are those with the largest effect sizes. The vertical line denote the $κ$ values required for prediction accuracy of $P > 0.95$ . In all panels, 10,000 effect sizes were drawn from a standard normal distribution to represent the known and unknown effects on the phenotype.

**Figure 3:**
Predictions of the direction of phenotypic difference in humans. (a)–(c) The relationship between the known-to-unknown ratio ( $κ$ ) and the proportion of correct predictions in different phenotypes. The theoretical expectation (Eq. 4) is shown in gray. (a) Pairwise comparisons of siblings from the UK Biobank for six phenotypes. (b) Pairwise comparisons of individuals from the European group (self-identified White British with Northwestern European genetic ancestry) from the UK Biobank for the same six phenotypes. (c) Pairwise height comparisons of individuals from the same population (either European, East Asian or African, as defined in Fig. S6), using GWAS generated from a European-ancestry group in Yengo *et al.* (15). (d)–(f) The distribution of $κ$ values for all pairwise comparisons. Each panel corresponds to the panel above it.

**Figure 4:**
The effect of directional selection on predicting the direction of phenotypic difference. (a) Prediction accuracy under directional selection, modeled as a biased random walk. The random walks in this schematic are biased toward the positive direction, with larger effects having a stronger bias. Biased random walks increase prediction accuracy. (b) Prediction accuracy for different $κ$ values and different levels of bias, with 50% randomly selected known effects out of 10,000 overall. (c) Prediction accuracy across species. Each point represents the proportion of correct predictions. The number of phenotypes is noted above each data point. For sticklebacks, between 14 and 27 phenotypic predictions were made for four different freshwater populations. For mice, predictions were made for two phenotypes in 16 developmental stages.

See this image and copyright information in PMC

References

1. Rosenberg N. A., Edge M. D., Pritchard J. K. & Feldman M. W. Interpreting polygenic scores, polygenic adaptation, and human phenotypic differences. et al. 2019, 26–34 (2019). - PMC - PubMed
1. Young A. I., Benonisdottir S., Przeworski M. & Kong A. Deconstructing the sources of genotype-phenotype associations in humans. Science 365, 1396–1400 (2019). - PMC - PubMed
1. Dittmar E. L., Oakley C. G., Conner J. K., Gould B. A. & Schemske D. W. Factors influencing the effect size distribution of adaptive substitutions. Proceedings of the Royal Society B: Biological Sciences 283, 20153065 (2016). - PMC - PubMed
1. Orr H. A. The genetic theory of adaptation: a brief history. Nature Reviews Genetics 6, 119–127 (2005). - PubMed
1. Scheben A. & Edwards D. Towards a more predictable plant breeding pipeline with CRISPR/Cas-induced allelic series to optimize quantitative and qualitative traits. Current Opinion in Plant Biology 45, 218–225 (2018). - PubMed

Publication types

Actions

Grants and funding

R01 HG011711/HG/NHGRI NIH HHS/United States

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

This is a preprint.

Predicting the direction of phenotypic difference

Affiliations

Predicting the direction of phenotypic difference

Authors

Affiliations

Update in

Abstract

Figures

Similar articles

References

Publication types

Grants and funding

LinkOut - more resources

Full Text Sources

This is a preprint.

Update in

Abstract

Figures

Similar articles

References

Publication types

Related information

Grants and funding

LinkOut - more resources

Full Text Sources