Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Dec;19(12):1599-1611.
doi: 10.1038/s41592-022-01640-x. Epub 2022 Oct 27.

A framework for detecting noncoding rare-variant associations of large-scale whole-genome sequencing studies

Zilin Li #  1   2 Xihao Li #  3 Hufeng Zhou  3 Sheila M Gaynor  3 Margaret Sunitha Selvaraj  4   5   6 Theodore Arapoglou  3 Corbin Quick  3 Yaowu Liu  7 Han Chen  8   9 Ryan Sun  10 Rounak Dey  3 Donna K Arnett  11 Paul L Auer  12 Lawrence F Bielak  13 Joshua C Bis  14 Thomas W Blackwell  15 John Blangero  16 Eric Boerwinkle  8   17 Donald W Bowden  18 Jennifer A Brody  14 Brian E Cade  5   19   20 Matthew P Conomos  21 Adolfo Correa  22 L Adrienne Cupples  23   24 Joanne E Curran  16 Paul S de Vries  8 Ravindranath Duggirala  16 Nora Franceschini  25 Barry I Freedman  26 Harald H H Göring  16 Xiuqing Guo  27 Rita R Kalyani  28 Charles Kooperberg  29 Brian G Kral  28 Leslie A Lange  30 Bridget M Lin  31 Ani Manichaikul  32 Alisa K Manning  6   33   34 Lisa W Martin  35 Rasika A Mathias  28 James B Meigs  5   6   36 Braxton D Mitchell  37   38 May E Montasser  39 Alanna C Morrison  8 Take Naseri  40 Jeffrey R O'Connell  37 Nicholette D Palmer  18 Patricia A Peyser  13 Bruce M Psaty  14   41   42 Laura M Raffield  43 Susan Redline  19   20   44 Alexander P Reiner  29   41 Muagututi'a Sefuiva Reupena  45 Kenneth M Rice  21 Stephen S Rich  32 Jennifer A Smith  13   46 Kent D Taylor  27 Margaret A Taub  47 Ramachandran S Vasan  24   48 Daniel E Weeks  49 James G Wilson  50 Lisa R Yanek  28 Wei Zhao  13 NHLBI Trans-Omics for Precision Medicine (TOPMed) ConsortiumTOPMed Lipids Working GroupJerome I Rotter  27 Cristen J Willer  51   52   53 Pradeep Natarajan  4   5   6 Gina M Peloso  23   24 Xihong Lin  54   55   56
Collaborators, Affiliations

A framework for detecting noncoding rare-variant associations of large-scale whole-genome sequencing studies

Zilin Li et al. Nat Methods. 2022 Dec.

Abstract

Large-scale whole-genome sequencing studies have enabled analysis of noncoding rare-variant (RV) associations with complex human diseases and traits. Variant-set analysis is a powerful approach to study RV association. However, existing methods have limited ability in analyzing the noncoding genome. We propose a computationally efficient and robust noncoding RV association detection framework, STAARpipeline, to automatically annotate a whole-genome sequencing study and perform flexible noncoding RV association analysis, including gene-centric analysis and fixed window-based and dynamic window-based non-gene-centric analysis by incorporating variant functional annotations. In gene-centric analysis, STAARpipeline uses STAAR to group noncoding variants based on functional categories of genes and incorporate multiple functional annotations. In non-gene-centric analysis, STAARpipeline uses SCANG-STAAR to incorporate dynamic window sizes and multiple functional annotations. We apply STAARpipeline to identify noncoding RV sets associated with four lipid traits in 21,015 discovery samples from the Trans-Omics for Precision Medicine (TOPMed) program and replicate several of them in an additional 9,123 TOPMed samples. We also analyze five non-lipid TOPMed traits.

PubMed Disclaimer

Conflict of interest statement

Competing interests

S.M.G. is now an employee of Regeneron Genetics Center. J.B.M. is an Academic Associate for Quest Diagnostics R&D. For B.D.M.: The Amish Research Program receives partial support from Regeneron Pharmaceuticals. M.E.M. reports grant from Regeneron Pharmaceutical unrelated to the present work. B.M.P. serves on the Steering Committee of the Yale Open Data Access Project funded by Johnson & Johnson. L.M.R. is a consultant for the TOPMed Admistrative Coordinating Center (through Westat). For S.R.: Jazz Pharma, Eli Lilly, Apnimed, unrelated to the present work. The spouse of C.J.W. works at Regeneron Pharmaceuticals. P.N. reports investigator-initiated grants from Amgen, Apple, AstraZeneca, Boston Scientific, and Novartis, personal fees from Apple, AstraZeneca, Blackstone Life Sciences, Foresite Labs, Novartis, Roche / Genentech, is a co-founder of TenSixteen Bio, is a shareholder of geneXwell and TenSixteen Bio, and spousal employment at Vertex, all unrelated to the present work. X. Lin is a consultant of AbbVie Pharmaceuticals and Verily Life Sciences. The remaining authors declare no competing interests.

Figures

Extended Data Fig. 1|
Extended Data Fig. 1|. Rare variant (MAF < 0.01) distribution in the discovery phase using TOPMed cohorts (n=21,015).
Variant categories are defined by GENCODE VEP categories.
Extended Data Fig. 2|
Extended Data Fig. 2|. Manhattan plots and Q-Q plots for unconditional gene-centric noncoding analysis and sliding window analysis of high-density lipoprotein cholesterol (HDL-C) in the discovery phase (n=21,015).
a, Manhattan plots for unconditional gene-centric noncoding analysis of protein-coding gene. The horizontal line indicates a genome-wide STAAR-O P value threshold of 3.57×107. The significant threshold is defined by multiple comparisons using the Bonferroni correction 0.05/20,000 × 7=3.57 × 107. Different symbols represent the STAAR-O P value of the protein-coding gene using different functional categories (upstream, downstream, UTR, promoter_CAGE, promoter_DHS, enhancer_CAGE, enhancer_DHS). Promoter_CAGE and promoter_DHS are the promoters with overlap of Cap Analysis of Gene Expression (CAGE) sites and DNase hypersensitivity (DHS) sites for a given gene, respectively. Enhancer_CAGE and enhancer_DHS are the enhancers in GeneHancer predicted regions with the overlap of CAGE sites and DHS sites for a given gene, respectively. b, Quantile-quantile plots for unconditional gene-centric noncoding analysis of protein-coding gene. Different symbols represent the STAAR-O P-value of the gene using different functional categories (upstream, downstream, UTR, promoter_CAGE, promoter_DHS, enhancer_CAGE, enhancer_DHS). c, Manhattan plots for unconditional gene-centric noncoding analysis of ncRNA gene. The horizontal line indicates a genome-wide STAAR-O P value threshold of 2.50×106. The significant threshold is defined by multiple comparisons using the Bonferroni correction 0.05/20,000=2.50 × 106. d, Quantile-quantile plots for unconditional gene-centric noncoding analysis of ncRNA gene. e, Manhattan plot for 2-kb sliding windows. The horizontal line indicates a genome-wide P value threshold of 1.88 × 108. The significant threshold is defined by multiple comparisons using the Bonferroni correction 0.05/2.66 × 106=1.88 × 108. f, Quantile-quantile plot for 2-kb sliding windows. In panels, a, c and e, the chromosome number are indicated by the colors of dots. In all panels, STAAR-O is a two-sided test.
Extended Data Fig. 3|
Extended Data Fig. 3|. Manhattan plots and Q-Q plots for unconditional gene-centric noncoding analysis and sliding window analysis of low-density lipoprotein cholesterol (LDL-C) in the discovery phase (n=21,015).
a, Manhattan plots for unconditional gene-centric noncoding analysis of protein-coding gene. The horizontal line indicates a genome-wide STAAR-O P-value threshold of 3.57×107. The significant threshold is defined by multiple comparisons using the Bonferroni correction 0.05/20,000 × 7=3.57 × 107. Different symbols represent the STAAR-O P-value of the protein-coding gene using different functional categories (upstream, downstream, UTR, promoter_CAGE, promoter_DHS, enhancer_CAGE, enhancer_DHS). Promoter_CAGE and promoter_DHS are the promoters with overlap of Cap Analysis of Gene Expression (CAGE) sites and DNase hypersensitivity (DHS) sites for a given gene, respectively. Enhancer_CAGE and enhancer_DHS are the enhancers in GeneHancer predicted regions with the overlap of CAGE sites and DHS sites for a given gene, respectively. b, Quantile-quantile plots for unconditional gene-centric noncoding analysis of protein-coding gene. Different symbols represent the STAAR-O P-value of the gene using different functional categories (upstream, downstream, UTR, promoter_CAGE, promoter_DHS, enhancer_CAGE, enhancer_DHS). c, Manhattan plots for unconditional gene-centric noncoding analysis of ncRNA gene. The horizontal line indicates a genome-wide STAAR-O P-value threshold of 2.50×106. The significant threshold is defined by multiple comparisons using the Bonferroni correction 0.05/20,000=2.50 × 106. d, Quantile-quantile plots for unconditional gene-centric noncoding analysis of ncRNA gene. e, Manhattan plot for 2-kb sliding windows. The horizontal line indicates a genome-wide P-value threshold of 1.88 × 108. The significant threshold is defined by multiple comparisons using the Bonferroni correction 0.05/2.66 × 106=1.88 × 108. f, Quantile-quantile plot for 2-kb sliding windows. In panels, a, c and e, the chromosome number are indicated by the colors of dots. In all panels, STAAR-O is a two-sided test.
Extended Data Fig. 4|
Extended Data Fig. 4|. Manhattan plots and Q-Q plots for unconditional gene-centric noncoding analysis and sliding window analysis of triglycerides (TG) in the discovery phase (n=21,015).
a, Manhattan plots for unconditional gene-centric noncoding analysis of protein-coding gene. The horizontal line indicates a genome-wide STAAR-O P-value threshold of 3.57×107. The significant threshold is defined by multiple comparisons using the Bonferroni correction 0.05/20,000 × 7=3.57 × 107. Different symbols represent the STAAR-O P-value of the protein-coding gene using different functional categories (upstream, downstream, UTR, promoter_CAGE, promoter_DHS, enhancer_CAGE, enhancer_DHS). Promoter_CAGE and promoter_DHS are the promoters with overlap of Cap Analysis of Gene Expression (CAGE) sites and DNase hypersensitivity (DHS) sites for a given gene, respectively. Enhancer_CAGE and enhancer_DHS are the enhancers in GeneHancer predicted regions with the overlap of CAGE sites and DHS sites for a given gene, respectively. b, Quantile-quantile plots for unconditional gene-centric noncoding analysis of protein-coding gene. Different symbols represent the STAAR-O P-value of the gene using different functional categories (upstream, downstream, UTR, promoter_CAGE, promoter_DHS, enhancer_CAGE, enhancer_DHS). c, Manhattan plots for unconditional gene-centric noncoding analysis of ncRNA gene. The horizontal line indicates a genome-wide STAAR-O P-value threshold of 2.50×106. The significant threshold is defined by multiple comparisons using the Bonferroni correction 0.05/20,000=2.50 × 106. d, Quantile-quantile plots for unconditional gene-centric noncoding analysis of ncRNA gene. e, Manhattan plot for 2-kb sliding windows. The horizontal line indicates a genome-wide P-value threshold of 1.88 × 108. The significant threshold is defined by multiple comparisons using the Bonferroni correction 0.05/2.66 × 106=1.88 × 108. f, Quantile-quantile plot for 2-kb sliding windows. In panels, a, c and e, the chromosome number are indicated by the colors of dots. In all panels, STAAR-O is a two-sided test.
Extended Data Fig. 5|
Extended Data Fig. 5|. Manhattan plots and Q-Q plots for unconditional gene-centric noncoding analysis and sliding window analysis of total cholesterol (TC) in the discovery phase (n=21,015).
a, Manhattan plots for unconditional gene-centric noncoding analysis of protein-coding gene. The horizontal line indicates a genome-wide STAAR-O P-value threshold of 3.57×107. The significant threshold is defined by multiple comparisons using the Bonferroni correction 0.05/20,000 × 7=3.57 × 107. Different symbols represent the STAAR-O P-value of the protein-coding gene using different functional categories (upstream, downstream, UTR, promoter_CAGE, promoter_DHS, enhancer_CAGE, enhancer_DHS). Promoter_CAGE and promoter_DHS are the promoters with overlap of Cap Analysis of Gene Expression (CAGE) sites and DNase hypersensitivity (DHS) sites for a given gene, respectively. Enhancer_CAGE and enhancer_DHS are the enhancers in GeneHancer predicted regions with the overlap of CAGE sites and DHS sites for a given gene, respectively. b, Quantile-quantile plots for unconditional gene-centric noncoding analysis of protein-coding gene. Different symbols represent the STAAR-O P-value of the gene using different functional categories (upstream, downstream, UTR, promoter_CAGE, promoter_DHS, enhancer_CAGE, enhancer_DHS). c, Manhattan plots for unconditional gene-centric noncoding analysis of ncRNA gene. The horizontal line indicates a genome-wide STAAR-O P-value threshold of 2.50×106. The significant threshold is defined by multiple comparisons using the Bonferroni correction 0.05/20,000=2.50 × 106. d, Quantile-quantile plots for unconditional gene-centric noncoding analysis of ncRNA gene. e, Manhattan plot for 2-kb sliding windows. The horizontal line indicates a genome-wide P-value threshold of 1.88 × 108. The significant threshold is defined by multiple comparisons using the Bonferroni correction 0.05/2.66 × 106=1.88 × 108. f, Quantile-quantile plot for 2-kb sliding windows. In panels, a, c and e, the chromosome number are indicated by the colors of dots. In all panels, STAAR-O is a two-sided test.
Fig. 1 |
Fig. 1 |. Workflow of STAARpipeline.
(a) Prepare the input data of STAARpipeline, including genotypes, phenotypes and covariates. (b) Annotate all variants in the genome using FAVORannotator through FAVOR database and calculate the (sparse) genetic relatedness matrix. (c) Define analysis units in the noncoding genome: eight functional categories of regulatory regions, sliding windows and dynamic windows using SCANG. (d) Obtain genome-wide significant associations and perform analytical follow-up via conditional analysis.

Comment in

References

    1. Manolio TA et al. Finding the missing heritability of complex diseases. Nature 461, 747–753 (2009). - PMC - PubMed
    1. Wainschtein P et al. Assessing the contribution of rare variants to complex trait heritability from whole-genome sequence data. Nature Genetics 54, 263–273 (2022). - PMC - PubMed
    1. Hernandez RD et al. Ultrarare variants drive substantial cis heritability of human gene expression. Nature genetics 51, 1349–1355 (2019). - PMC - PubMed
    1. Taliun D et al. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Nature 590, 290–299 (2021). - PMC - PubMed
    1. Flannick J et al. Exome sequencing of 20,791 cases of type 2 diabetes and 24,440 controls. Nature 570, 71–76 (2019). - PMC - PubMed

Methods-only references

    1. Chen H et al. Efficient variant set mixed model association tests for continuous and binary traits in large-scale whole-genome sequencing studies. The American Journal of Human Genetics 104, 260–274 (2019). - PMC - PubMed
    1. Gazal S et al. Linkage disequilibrium–dependent architecture of human complex traits shows action of negative selection. Nature Genetics 49, 1421–1427 (2017). - PMC - PubMed
    1. Li X & Li Z xihaoli/STAARpipeline: STAARpipeline_v0.9.6 Version 0.9.6 10.5281/zenodo.6871504 (2022). - DOI
    1. Li X & Li Z xihaoli/STAARpipelineSummary: STAARpipelineSummary_v0.9.6 Version 0.9.6 10.5281/zenodo.6871524 (2022). - DOI
    1. Li X & Li Z xihaoli/STAARpipeline-Tutorial: v0.9.6 Version 0.9.6 10.5281/zenodo.6871408 (2022). - DOI

Publication types

Grants and funding