. 2022 Feb 24;17(2):e0264341.

doi: 10.1371/journal.pone.0264341. eCollection 2022.

Protein prediction for trait mapping in diverse populations

Ryan Schubert^{1

2

3}, Elyse Geoffroy³, Isabelle Gregga², Ashley J Mulford^{2

3}, Francois Aguet⁴, Kristin Ardlie⁴, Robert Gerszten⁵, Clary Clish⁴, David Van Den Berg⁶, Kent D Taylor⁷, Peter Durda⁸, W Craig Johnson⁹, Elaine Cornell⁸, Xiuqing Guo⁷, Yongmei Liu¹⁰, Russell Tracy⁸, Matthew Conomos¹¹, Tom Blackwell¹², George Papanicolaou¹³, Tuuli Lappalainen¹⁴, Anna V Mikhaylova¹¹, Timothy A Thornton¹¹, Michael H Cho¹⁵, Christopher R Gignoux¹⁶, Leslie Lange¹⁶, Ethan Lange¹⁶, Stephen S Rich¹⁷, Jerome I Rotter⁷; NHLBI TOPMed Consortium; Ani Manichaikul¹⁷, Hae Kyung Im¹⁸, Heather E Wheeler^{2

3}

Affiliations

¹ Department of Mathematics and Statistics, Loyola University Chicago, Chicago, IL, United States of America.
² Department of Biology, Loyola University Chicago, Chicago, IL, United States of America.
³ Program in Bioinformatics, Loyola University Chicago, Chicago, IL, United States of America.
⁴ Broad Institute, Cambridge, MA, United States of America.
⁵ Beth Israel Deaconess Medical Center, Boston, MA, United States of America.
⁶ University of Southern California, Los Angeles, CA, United States of America.
⁷ The Institute for Translational Genomics and Population Sciences, Department of Pediatrics, The Lundquist Institute for Biomedical Innovation at Harbor-UCLA Medical Center, Torrance, CA, United States of America.
⁸ Laboratory for Clinical Biochemistry Research, University of Vermont, Burlington, VT, United States of America.
⁹ Collaborative Health Studies Coordinating Center, University of Washington, Seattle, WA, United States of America.
¹⁰ Department of Medicine, Duke University School of Medicine, Durham, NC, United States of America.
¹¹ Department of Biostatistics, University of Washington, Seattle, WA, United States of America.
¹² Department of Biostatistics, University of Michigan, Ann Arbor, MI, United States of America.
¹³ Epidemiology Branch, National Heart, Lung and Blood Institute, Bethesda, MD, United States of America.
¹⁴ New York Genome Center and Department of Systems Biology, Columbia University, New York, NY United States of America.
¹⁵ Channing Division of Network Medicine, Brigham and Women's Hospital, Boston, MA, United States of America.
¹⁶ Division of Biomedical Informatics and Personalized Medicine, Department of Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO, United States of America.
¹⁷ Center for Public Health Genomics, University of Virginia, Charlottesville, VA, United States of America.
¹⁸ Section of Genetic Medicine, The University of Chicago, Chicago, IL, United States of America.

PMID: 35202437
PMCID: PMC8870552
DOI: 10.1371/journal.pone.0264341

Protein prediction for trait mapping in diverse populations

Ryan Schubert et al. PLoS One. 2022.

. 2022 Feb 24;17(2):e0264341.

doi: 10.1371/journal.pone.0264341. eCollection 2022.

Authors

Affiliations

¹ Department of Mathematics and Statistics, Loyola University Chicago, Chicago, IL, United States of America.
² Department of Biology, Loyola University Chicago, Chicago, IL, United States of America.
³ Program in Bioinformatics, Loyola University Chicago, Chicago, IL, United States of America.
⁴ Broad Institute, Cambridge, MA, United States of America.
⁵ Beth Israel Deaconess Medical Center, Boston, MA, United States of America.
⁶ University of Southern California, Los Angeles, CA, United States of America.
⁷ The Institute for Translational Genomics and Population Sciences, Department of Pediatrics, The Lundquist Institute for Biomedical Innovation at Harbor-UCLA Medical Center, Torrance, CA, United States of America.
⁸ Laboratory for Clinical Biochemistry Research, University of Vermont, Burlington, VT, United States of America.
⁹ Collaborative Health Studies Coordinating Center, University of Washington, Seattle, WA, United States of America.
¹⁰ Department of Medicine, Duke University School of Medicine, Durham, NC, United States of America.
¹¹ Department of Biostatistics, University of Washington, Seattle, WA, United States of America.
¹² Department of Biostatistics, University of Michigan, Ann Arbor, MI, United States of America.
¹³ Epidemiology Branch, National Heart, Lung and Blood Institute, Bethesda, MD, United States of America.
¹⁴ New York Genome Center and Department of Systems Biology, Columbia University, New York, NY United States of America.
¹⁵ Channing Division of Network Medicine, Brigham and Women's Hospital, Boston, MA, United States of America.
¹⁶ Division of Biomedical Informatics and Personalized Medicine, Department of Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO, United States of America.
¹⁷ Center for Public Health Genomics, University of Virginia, Charlottesville, VA, United States of America.
¹⁸ Section of Genetic Medicine, The University of Chicago, Chicago, IL, United States of America.

PMID: 35202437
PMCID: PMC8870552
DOI: 10.1371/journal.pone.0264341

Abstract

Genetically regulated gene expression has helped elucidate the biological mechanisms underlying complex traits. Improved high-throughput technology allows similar interrogation of the genetically regulated proteome for understanding complex trait mechanisms. Here, we used the Trans-omics for Precision Medicine (TOPMed) Multi-omics pilot study, which comprises data from Multi-Ethnic Study of Atherosclerosis (MESA), to optimize genetic predictors of the plasma proteome for genetically regulated proteome-wide association studies (PWAS) in diverse populations. We built predictive models for protein abundances using data collected in TOPMed MESA, for which we have measured 1,305 proteins by a SOMAscan assay. We compared predictive models built via elastic net regression to models integrating posterior inclusion probabilities estimated by fine-mapping SNPs prior to elastic net. In order to investigate the transferability of predictive models across ancestries, we built protein prediction models in all four of the TOPMed MESA populations, African American (n = 183), Chinese (n = 71), European (n = 416), and Hispanic/Latino (n = 301), as well as in all populations combined. As expected, fine-mapping produced more significant protein prediction models, especially in African ancestries populations, potentially increasing opportunity for discovery. When we tested our TOPMed MESA models in the independent European INTERVAL study, fine-mapping improved cross-ancestries prediction for some proteins. Using GWAS summary statistics from the Population Architecture using Genomics and Epidemiology (PAGE) study, which comprises ∼50,000 Hispanic/Latinos, African Americans, Asians, Native Hawaiians, and Native Americans, we applied S-PrediXcan to perform PWAS for 28 complex traits. The most protein-trait associations were discovered, colocalized, and replicated in large independent GWAS using proteome prediction model training populations with similar ancestries to PAGE. At current training population sample sizes, performance between baseline and fine-mapped protein prediction models in PWAS was similar, highlighting the utility of elastic net. Our predictive models in diverse populations are publicly available for use in proteome mapping methods at https://doi.org/10.5281/zenodo.4837327.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

**Fig 1. Protein prediction performance in TOPMed MESA populations.**
A. Distributions of prediction performance across proteins within each training population between modeling strategies. ρ is the Spearman correlation between predicted and observed protein abundance in the cross-validation. Fine-mapping prior to elastic net modeling produces more significant (ρ > 0.1, vertical dotted line) protein prediction models than baseline elastic net. B. Significant (ρ > 0.1, p < 0.05) protein model counts compared to population sample size colored by modeling strategy. TOPMed MESA populations: CHN, Chinese; AFA, African American; HIS, Hispanic/Latino; EUR, European; ALL, all populations combined.

**Fig 2. TOPMed MESA protein prediction model performance comparison in the independent INTERVAL population.**
Within each training population, the fine-mapped model performance in INTERVAL (y-axis) is compared to the baseline elastic net model performance in INTERVAL (x-axis). Each dot represents a protein that is predicted by both baseline models and fine-mapped models. Performance was measured as the Spearman ρ between the measured protein aptamer level and the predicted protein aptamer level. Fine-mapped models performed better than baseline models in AFA (Wilcoxon signed-rank test, p = 0.0016) and CHN (p = 0.036), were not significantly different in EUR (p = 0.74) and HIS (p = 0.54), and significantly worse in ALL (p = 0.0085). TOPMed MESA populations: AFA, African American; ALL, all populations combined; CHN, Chinese; EUR, European; HIS, Hispanic/Latino.

**Fig 3. Protein prediction performance between training populations within each model building strategy.**
We compare the performance of TOPMed MESA ALL and EUR training populations in the INTERVAL study, a European population. For each model building strategy we first take the intersection of proteins that are predicted by both training populations and then test for differences in the distributions of Spearman correlation (ρ) by a Wilcoxon signed-rank test. INTERVAL ρ was significantly higher when we used the ALL training population in both our baseline (p = 0.0012) and fine-mapped (p = 0.0064) modeling strategies. (A) The distributions of INTERVAL ρ are plotted in each training population and modeling strategy. (B) The pairwise performance comparisons between ALL and EUR training populations are shown, each point represents a protein. The blue contour lines from two-dimensional kernel density estimation help visualize where the points are concentrated.

**Fig 4. Allele frequency differences lead to protein predictive performance differences between populations.**
Comparison of mean F_ST differences between protein models with large (>t) and small (< = t) differences in predictive performance ρ in INTERVAL. For baseline models, protein groups with the larger absolute value ρ difference between TOPMed MESA training populations had significantly larger mean F_ST at each difference threshold, t (Wilcoxon rank sum tests, p < 3.1 × 10⁻¹⁰). For fine-mapped models, the differences between protein groups were attenuated, but still significant when t = 0.1 (p = 0.0028) and t = 0.2 (p = 0.010).

**Fig 5. Predicted protein-trait association results summary.**
(A) Bonferroni significant (baseline p < 1.54 × 10⁻⁶; fine-mapped p < 7.60 × 10⁻⁷) protein-trait association counts when we applied S-PrediXcan to 28 traits in PAGE using protein prediction models from each TOPMed MESA population and model building strategy. (B) Protein-trait pairs from A that also have a COLOC colocalization probability > 0.5. (C) Protein-trait pairs from B that replicate (baseline p < 1.54 × 10⁻⁶; fine-mapped p < 9.59 × 10⁻⁷) in independent studies from the UKBioBank or other large, European ancestries cohorts. Bonferroni threshold for fine-mapped models is calculated separately from the Bonferroni threshold for baseline models.

**Fig 6. Distribution of adjusted protein abundance.**
We observe a linear association between *APOE* genotype and mean abundance of each Apo E isoform. Note that within a genotype, the target isoforms from the SOMAscan assay do not vary, indicating epitope cross-reactivity effects are likely. Top: Association in TOPMed ALL β = 0.498, p = 4.60 × 10⁻²⁷. Bottom: Association in INTERVAL β = 0.295, p = 1.98 × 10⁻³⁵. Only two isoforms were available in the INTERVAL dataset.

See this image and copyright information in PMC

References

1. Wojcik GL, Graff M, Nishimura KK, Tao R, Haessler J, Gignoux CR, et al. Genetic analyses of diverse populations improves discovery for complex traits. Nature. 2019;570(7762):514–518. doi: 10.1038/s41586-019-1310-4 - DOI - PMC - PubMed
1. Neale BM. UK Biobank GWAS—Neale Lab; 2018. Available from: http://www.nealelab.is/uk-biobank/.
1. Wheeler E, Leong A, Liu CT, Hivert MF, Strawbridge RJ, Podmore C, et al. Impact of common genetic determinants of Hemoglobin A1c on type 2 diabetes risk and diagnosis in ancestrally diverse populations: A transethnic genome-wide meta-analysis. PLoS medicine. 2017;14(9):e1002383–e1002383. doi: 10.1371/journal.pmed.1002383 - DOI - PMC - PubMed
1. Manning AK, Hivert MF, Scott RA, Grimsby JL, Bouatia-Naji N, Chen H, et al. A genome-wide approach accounting for body mass index identifies genetic variants influencing fasting glycemic traits and insulin resistance. Nature genetics. 2012;44(6):659–669. doi: 10.1038/ng.2274 - DOI - PMC - PubMed
1. Gondalia R, Avery CL, Napier MD, Méndez-Giráldez R, Stewart JD, Sitlani CM, et al. Genome-wide Association Study of Susceptibility to Particulate Matter-Associated QT Prolongation. Environmental health perspectives. 2017;125(6):067002–067002. doi: 10.1289/EHP347 - DOI - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Protein prediction for trait mapping in diverse populations

Affiliations

Protein prediction for trait mapping in diverse populations

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Medical

Research Materials