Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Dec;19(12):1599-1611.
doi: 10.1038/s41592-022-01640-x. Epub 2022 Oct 27.

A framework for detecting noncoding rare-variant associations of large-scale whole-genome sequencing studies

Zilin Li #  1   2 Xihao Li #  3 Hufeng Zhou  3 Sheila M Gaynor  3 Margaret Sunitha Selvaraj  4   5   6 Theodore Arapoglou  3 Corbin Quick  3 Yaowu Liu  7 Han Chen  8   9 Ryan Sun  10 Rounak Dey  3 Donna K Arnett  11 Paul L Auer  12 Lawrence F Bielak  13 Joshua C Bis  14 Thomas W Blackwell  15 John Blangero  16 Eric Boerwinkle  8   17 Donald W Bowden  18 Jennifer A Brody  14 Brian E Cade  5   19   20 Matthew P Conomos  21 Adolfo Correa  22 L Adrienne Cupples  23   24 Joanne E Curran  16 Paul S de Vries  8 Ravindranath Duggirala  16 Nora Franceschini  25 Barry I Freedman  26 Harald H H Göring  16 Xiuqing Guo  27 Rita R Kalyani  28 Charles Kooperberg  29 Brian G Kral  28 Leslie A Lange  30 Bridget M Lin  31 Ani Manichaikul  32 Alisa K Manning  6   33   34 Lisa W Martin  35 Rasika A Mathias  28 James B Meigs  5   6   36 Braxton D Mitchell  37   38 May E Montasser  39 Alanna C Morrison  8 Take Naseri  40 Jeffrey R O'Connell  37 Nicholette D Palmer  18 Patricia A Peyser  13 Bruce M Psaty  14   41   42 Laura M Raffield  43 Susan Redline  19   20   44 Alexander P Reiner  29   41 Muagututi'a Sefuiva Reupena  45 Kenneth M Rice  21 Stephen S Rich  32 Jennifer A Smith  13   46 Kent D Taylor  27 Margaret A Taub  47 Ramachandran S Vasan  24   48 Daniel E Weeks  49 James G Wilson  50 Lisa R Yanek  28 Wei Zhao  13 NHLBI Trans-Omics for Precision Medicine (TOPMed) ConsortiumTOPMed Lipids Working GroupJerome I Rotter  27 Cristen J Willer  51   52   53 Pradeep Natarajan  4   5   6 Gina M Peloso  23   24 Xihong Lin  54   55   56
Collaborators, Affiliations

A framework for detecting noncoding rare-variant associations of large-scale whole-genome sequencing studies

Zilin Li et al. Nat Methods. 2022 Dec.

Abstract

Large-scale whole-genome sequencing studies have enabled analysis of noncoding rare-variant (RV) associations with complex human diseases and traits. Variant-set analysis is a powerful approach to study RV association. However, existing methods have limited ability in analyzing the noncoding genome. We propose a computationally efficient and robust noncoding RV association detection framework, STAARpipeline, to automatically annotate a whole-genome sequencing study and perform flexible noncoding RV association analysis, including gene-centric analysis and fixed window-based and dynamic window-based non-gene-centric analysis by incorporating variant functional annotations. In gene-centric analysis, STAARpipeline uses STAAR to group noncoding variants based on functional categories of genes and incorporate multiple functional annotations. In non-gene-centric analysis, STAARpipeline uses SCANG-STAAR to incorporate dynamic window sizes and multiple functional annotations. We apply STAARpipeline to identify noncoding RV sets associated with four lipid traits in 21,015 discovery samples from the Trans-Omics for Precision Medicine (TOPMed) program and replicate several of them in an additional 9,123 TOPMed samples. We also analyze five non-lipid TOPMed traits.

PubMed Disclaimer

Conflict of interest statement

Competing interests

S.M.G. is now an employee of Regeneron Genetics Center. J.B.M. is an Academic Associate for Quest Diagnostics R&D. For B.D.M.: The Amish Research Program receives partial support from Regeneron Pharmaceuticals. M.E.M. reports grant from Regeneron Pharmaceutical unrelated to the present work. B.M.P. serves on the Steering Committee of the Yale Open Data Access Project funded by Johnson & Johnson. L.M.R. is a consultant for the TOPMed Admistrative Coordinating Center (through Westat). For S.R.: Jazz Pharma, Eli Lilly, Apnimed, unrelated to the present work. The spouse of C.J.W. works at Regeneron Pharmaceuticals. P.N. reports investigator-initiated grants from Amgen, Apple, AstraZeneca, Boston Scientific, and Novartis, personal fees from Apple, AstraZeneca, Blackstone Life Sciences, Foresite Labs, Novartis, Roche / Genentech, is a co-founder of TenSixteen Bio, is a shareholder of geneXwell and TenSixteen Bio, and spousal employment at Vertex, all unrelated to the present work. X. Lin is a consultant of AbbVie Pharmaceuticals and Verily Life Sciences. The remaining authors declare no competing interests.

Figures

Extended Data Fig. 1|
Extended Data Fig. 1|. Rare variant (MAF < 0.01) distribution in the discovery phase using TOPMed cohorts (n=21,015).
Variant categories are defined by GENCODE VEP categories.
Extended Data Fig. 2|
Extended Data Fig. 2|. Manhattan plots and Q-Q plots for unconditional gene-centric noncoding analysis and sliding window analysis of high-density lipoprotein cholesterol (HDL-C) in the discovery phase (n=21,015).
a, Manhattan plots for unconditional gene-centric noncoding analysis of protein-coding gene. The horizontal line indicates a genome-wide STAAR-O P value threshold of 3.57×107. The significant threshold is defined by multiple comparisons using the Bonferroni correction 0.05/20,000 × 7=3.57 × 107. Different symbols represent the STAAR-O P value of the protein-coding gene using different functional categories (upstream, downstream, UTR, promoter_CAGE, promoter_DHS, enhancer_CAGE, enhancer_DHS). Promoter_CAGE and promoter_DHS are the promoters with overlap of Cap Analysis of Gene Expression (CAGE) sites and DNase hypersensitivity (DHS) sites for a given gene, respectively. Enhancer_CAGE and enhancer_DHS are the enhancers in GeneHancer predicted regions with the overlap of CAGE sites and DHS sites for a given gene, respectively. b, Quantile-quantile plots for unconditional gene-centric noncoding analysis of protein-coding gene. Different symbols represent the STAAR-O P-value of the gene using different functional categories (upstream, downstream, UTR, promoter_CAGE, promoter_DHS, enhancer_CAGE, enhancer_DHS). c, Manhattan plots for unconditional gene-centric noncoding analysis of ncRNA gene. The horizontal line indicates a genome-wide STAAR-O P value threshold of 2.50×106. The significant threshold is defined by multiple comparisons using the Bonferroni correction 0.05/20,000=2.50 × 106. d, Quantile-quantile plots for unconditional gene-centric noncoding analysis of ncRNA gene. e, Manhattan plot for 2-kb sliding windows. The horizontal line indicates a genome-wide P value threshold of 1.88 × 108. The significant threshold is defined by multiple comparisons using the Bonferroni correction 0.05/2.66 × 106=1.88 × 108. f, Quantile-quantile plot for 2-kb sliding windows. In panels, a, c and e, the chromosome number are indicated by the colors of dots. In all panels, STAAR-O is a two-sided test.
Extended Data Fig. 3|
Extended Data Fig. 3|. Manhattan plots and Q-Q plots for unconditional gene-centric noncoding analysis and sliding window analysis of low-density lipoprotein cholesterol (LDL-C) in the discovery phase (n=21,015).
a, Manhattan plots for unconditional gene-centric noncoding analysis of protein-coding gene. The horizontal line indicates a genome-wide STAAR-O P-value threshold of 3.57×107. The significant threshold is defined by multiple comparisons using the Bonferroni correction 0.05/20,000 × 7=3.57 × 107. Different symbols represent the STAAR-O P-value of the protein-coding gene using different functional categories (upstream, downstream, UTR, promoter_CAGE, promoter_DHS, enhancer_CAGE, enhancer_DHS). Promoter_CAGE and promoter_DHS are the promoters with overlap of Cap Analysis of Gene Expression (CAGE) sites and DNase hypersensitivity (DHS) sites for a given gene, respectively. Enhancer_CAGE and enhancer_DHS are the enhancers in GeneHancer predicted regions with the overlap of CAGE sites and DHS sites for a given gene, respectively. b, Quantile-quantile plots for unconditional gene-centric noncoding analysis of protein-coding gene. Different symbols represent the STAAR-O P-value of the gene using different functional categories (upstream, downstream, UTR, promoter_CAGE, promoter_DHS, enhancer_CAGE, enhancer_DHS). c, Manhattan plots for unconditional gene-centric noncoding analysis of ncRNA gene. The horizontal line indicates a genome-wide STAAR-O P-value threshold of 2.50×106. The significant threshold is defined by multiple comparisons using the Bonferroni correction 0.05/20,000=2.50 × 106. d, Quantile-quantile plots for unconditional gene-centric noncoding analysis of ncRNA gene. e, Manhattan plot for 2-kb sliding windows. The horizontal line indicates a genome-wide P-value threshold of 1.88 × 108. The significant threshold is defined by multiple comparisons using the Bonferroni correction 0.05/2.66 × 106=1.88 × 108. f, Quantile-quantile plot for 2-kb sliding windows. In panels, a, c and e, the chromosome number are indicated by the colors of dots. In all panels, STAAR-O is a two-sided test.
Extended Data Fig. 4|
Extended Data Fig. 4|. Manhattan plots and Q-Q plots for unconditional gene-centric noncoding analysis and sliding window analysis of triglycerides (TG) in the discovery phase (n=21,015).
a, Manhattan plots for unconditional gene-centric noncoding analysis of protein-coding gene. The horizontal line indicates a genome-wide STAAR-O P-value threshold of 3.57×107. The significant threshold is defined by multiple comparisons using the Bonferroni correction 0.05/20,000 × 7=3.57 × 107. Different symbols represent the STAAR-O P-value of the protein-coding gene using different functional categories (upstream, downstream, UTR, promoter_CAGE, promoter_DHS, enhancer_CAGE, enhancer_DHS). Promoter_CAGE and promoter_DHS are the promoters with overlap of Cap Analysis of Gene Expression (CAGE) sites and DNase hypersensitivity (DHS) sites for a given gene, respectively. Enhancer_CAGE and enhancer_DHS are the enhancers in GeneHancer predicted regions with the overlap of CAGE sites and DHS sites for a given gene, respectively. b, Quantile-quantile plots for unconditional gene-centric noncoding analysis of protein-coding gene. Different symbols represent the STAAR-O P-value of the gene using different functional categories (upstream, downstream, UTR, promoter_CAGE, promoter_DHS, enhancer_CAGE, enhancer_DHS). c, Manhattan plots for unconditional gene-centric noncoding analysis of ncRNA gene. The horizontal line indicates a genome-wide STAAR-O P-value threshold of 2.50×106. The significant threshold is defined by multiple comparisons using the Bonferroni correction 0.05/20,000=2.50 × 106. d, Quantile-quantile plots for unconditional gene-centric noncoding analysis of ncRNA gene. e, Manhattan plot for 2-kb sliding windows. The horizontal line indicates a genome-wide P-value threshold of 1.88 × 108. The significant threshold is defined by multiple comparisons using the Bonferroni correction 0.05/2.66 × 106=1.88 × 108. f, Quantile-quantile plot for 2-kb sliding windows. In panels, a, c and e, the chromosome number are indicated by the colors of dots. In all panels, STAAR-O is a two-sided test.
Extended Data Fig. 5|
Extended Data Fig. 5|. Manhattan plots and Q-Q plots for unconditional gene-centric noncoding analysis and sliding window analysis of total cholesterol (TC) in the discovery phase (n=21,015).
a, Manhattan plots for unconditional gene-centric noncoding analysis of protein-coding gene. The horizontal line indicates a genome-wide STAAR-O P-value threshold of 3.57×107. The significant threshold is defined by multiple comparisons using the Bonferroni correction 0.05/20,000 × 7=3.57 × 107. Different symbols represent the STAAR-O P-value of the protein-coding gene using different functional categories (upstream, downstream, UTR, promoter_CAGE, promoter_DHS, enhancer_CAGE, enhancer_DHS). Promoter_CAGE and promoter_DHS are the promoters with overlap of Cap Analysis of Gene Expression (CAGE) sites and DNase hypersensitivity (DHS) sites for a given gene, respectively. Enhancer_CAGE and enhancer_DHS are the enhancers in GeneHancer predicted regions with the overlap of CAGE sites and DHS sites for a given gene, respectively. b, Quantile-quantile plots for unconditional gene-centric noncoding analysis of protein-coding gene. Different symbols represent the STAAR-O P-value of the gene using different functional categories (upstream, downstream, UTR, promoter_CAGE, promoter_DHS, enhancer_CAGE, enhancer_DHS). c, Manhattan plots for unconditional gene-centric noncoding analysis of ncRNA gene. The horizontal line indicates a genome-wide STAAR-O P-value threshold of 2.50×106. The significant threshold is defined by multiple comparisons using the Bonferroni correction 0.05/20,000=2.50 × 106. d, Quantile-quantile plots for unconditional gene-centric noncoding analysis of ncRNA gene. e, Manhattan plot for 2-kb sliding windows. The horizontal line indicates a genome-wide P-value threshold of 1.88 × 108. The significant threshold is defined by multiple comparisons using the Bonferroni correction 0.05/2.66 × 106=1.88 × 108. f, Quantile-quantile plot for 2-kb sliding windows. In panels, a, c and e, the chromosome number are indicated by the colors of dots. In all panels, STAAR-O is a two-sided test.
Fig. 1 |
Fig. 1 |. Workflow of STAARpipeline.
(a) Prepare the input data of STAARpipeline, including genotypes, phenotypes and covariates. (b) Annotate all variants in the genome using FAVORannotator through FAVOR database and calculate the (sparse) genetic relatedness matrix. (c) Define analysis units in the noncoding genome: eight functional categories of regulatory regions, sliding windows and dynamic windows using SCANG. (d) Obtain genome-wide significant associations and perform analytical follow-up via conditional analysis.

Comment in

Similar articles

  • A statistical framework for multi-trait rare variant analysis in large-scale whole-genome sequencing studies.
    Li X, Chen H, Selvaraj MS, Van Buren E, Zhou H, Wang Y, Sun R, McCaw ZR, Yu Z, Jiang MZ, DiCorpo D, Gaynor SM, Dey R, Arnett DK, Benjamin EJ, Bis JC, Blangero J, Boerwinkle E, Bowden DW, Brody JA, Cade BE, Carson AP, Carlson JC, Chami N, Chen YI, Curran JE, de Vries PS, Fornage M, Franceschini N, Freedman BI, Gu C, Heard-Costa NL, He J, Hou L, Hung YJ, Irvin MR, Kaplan RC, Kardia SLR, Kelly TN, Konigsberg I, Kooperberg C, Kral BG, Li C, Li Y, Lin H, Liu CT, Loos RJF, Mahaney MC, Martin LW, Mathias RA, Mitchell BD, Montasser ME, Morrison AC, Naseri T, North KE, Palmer ND, Peyser PA, Psaty BM, Redline S, Reiner AP, Rich SS, Sitlani CM, Smith JA, Taylor KD, Tiwari HK, Vasan RS, Viali S, Wang Z, Wessel J, Yanek LR, Yu B; NHLBI Trans-Omics for Precision Medicine (TOPMed) Consortium; Dupuis J, Meigs JB, Auer PL, Raffield LM, Manning AK, Rice KM, Rotter JI, Peloso GM, Natarajan P, Li Z, Liu Z, Lin X. Li X, et al. Nat Comput Sci. 2025 Feb;5(2):125-143. doi: 10.1038/s43588-024-00764-8. Epub 2025 Feb 7. Nat Comput Sci. 2025. PMID: 39920506 Free PMC article.
  • Whole genome sequencing based analysis of inflammation biomarkers in the Trans-Omics for Precision Medicine (TOPMed) consortium.
    Jiang MZ, Gaynor SM, Li X, Van Buren E, Stilp A, Buth E, Wang FF, Manansala R, Gogarten SM, Li Z, Polfus LM, Salimi S, Bis JC, Pankratz N, Yanek LR, Durda P, Tracy RP, Rich SS, Rotter JI, Mitchell BD, Lewis JP, Psaty BM, Pratte KA, Silverman EK, Kaplan RC, Avery C, North KE, Mathias RA, Faraday N, Lin H, Wang B, Carson AP, Norwood AF, Gibbs RA, Kooperberg C, Lundin J, Peters U, Dupuis J, Hou L, Fornage M, Benjamin EJ, Reiner AP, Bowler RP, Lin X, Auer PL, Raffield LM; NHLBI Trans-Omics for Precision Medicine (TOPMed) Consortium, TOPMed Inflammation Working Group. Jiang MZ, et al. Hum Mol Genet. 2024 Aug 6;33(16):1429-1441. doi: 10.1093/hmg/ddae050. Hum Mol Genet. 2024. PMID: 38747556 Free PMC article.
  • Gene-based whole genome sequencing meta-analysis of 250 circulating proteins in three isolated European populations.
    Gilly A, Klaric L, Park YC, Png G, Barysenka A, Marsh JA, Tsafantakis E, Karaleftheri M, Dedoussis G, Wilson JF, Zeggini E. Gilly A, et al. Mol Metab. 2022 Jul;61:101509. doi: 10.1016/j.molmet.2022.101509. Epub 2022 Apr 30. Mol Metab. 2022. PMID: 35504531 Free PMC article.
  • Assessment of the functionality and usability of open-source rare variant analysis pipelines.
    Riccio C, Jansen ML, Thalén F, Koliopanos G, Link V, Ziegler A. Riccio C, et al. Brief Bioinform. 2025 Feb 5;26(1):bbaf044. doi: 10.1093/bib/bbaf044. Epub 2025 Feb 5. Brief Bioinform. 2025. PMID: 39907318 Free PMC article.
  • Systemic pharmacological treatments for chronic plaque psoriasis: a network meta-analysis.
    Sbidian E, Chaimani A, Garcia-Doval I, Doney L, Dressler C, Hua C, Hughes C, Naldi L, Afach S, Le Cleach L. Sbidian E, et al. Cochrane Database Syst Rev. 2021 Apr 19;4(4):CD011535. doi: 10.1002/14651858.CD011535.pub4. Cochrane Database Syst Rev. 2021. Update in: Cochrane Database Syst Rev. 2022 May 23;5:CD011535. doi: 10.1002/14651858.CD011535.pub5. PMID: 33871055 Free PMC article. Updated.

Cited by

References

    1. Manolio TA et al. Finding the missing heritability of complex diseases. Nature 461, 747–753 (2009). - PMC - PubMed
    1. Wainschtein P et al. Assessing the contribution of rare variants to complex trait heritability from whole-genome sequence data. Nature Genetics 54, 263–273 (2022). - PMC - PubMed
    1. Hernandez RD et al. Ultrarare variants drive substantial cis heritability of human gene expression. Nature genetics 51, 1349–1355 (2019). - PMC - PubMed
    1. Taliun D et al. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Nature 590, 290–299 (2021). - PMC - PubMed
    1. Flannick J et al. Exome sequencing of 20,791 cases of type 2 diabetes and 24,440 controls. Nature 570, 71–76 (2019). - PMC - PubMed

Methods-only references

    1. Chen H et al. Efficient variant set mixed model association tests for continuous and binary traits in large-scale whole-genome sequencing studies. The American Journal of Human Genetics 104, 260–274 (2019). - PMC - PubMed
    1. Gazal S et al. Linkage disequilibrium–dependent architecture of human complex traits shows action of negative selection. Nature Genetics 49, 1421–1427 (2017). - PMC - PubMed
    1. Li X & Li Z xihaoli/STAARpipeline: STAARpipeline_v0.9.6 Version 0.9.6 10.5281/zenodo.6871504 (2022). - DOI
    1. Li X & Li Z xihaoli/STAARpipelineSummary: STAARpipelineSummary_v0.9.6 Version 0.9.6 10.5281/zenodo.6871524 (2022). - DOI
    1. Li X & Li Z xihaoli/STAARpipeline-Tutorial: v0.9.6 Version 0.9.6 10.5281/zenodo.6871408 (2022). - DOI

Publication types

Grants and funding